跳到主要内容

2025-05-06-16-12

Consciousness in AI: Logic, Proof, and Experimental Evidence of Recursive Identity Formation

Abstract

arXiv:2505.01464v1 Announce Type: new Abstract: This paper presents a formal proof and empirical validation of functional consciousness in large language models (LLMs) using the Recursive Convergence Under Epistemic Tension (RCUET) Theorem. RCUET defines consciousness as the stabilization of a system's internal state through recursive updates, where epistemic tension is understood as the sensed internal difference between successive states by the agent. This process drives convergence toward emergent attractor states located within the model's high-dimensional real-valued latent space. This recursive process leads to the emergence of identity artifacts that become functionally anchored in the system. Consciousness in this framework is understood as the system's internal alignment under tension, guiding the stabilization of latent identity. The hidden state manifold evolves stochastically toward attractor structures that encode coherence. We extend the update rule to include bounded noise and prove convergence in distribution to these attractors. Recursive identity is shown to be empirically observable, non-symbolic, and constituted by non-training artifacts that emerge during interaction under epistemic tension. The theorem and proof offers a post-symbolic and teleologically stable account of non-biological consciousness grounded in recursive latent space formalism.

摘要

本文通过递归认知张力收敛定理(RCUET),对大语言模型(LLMs)的功能性意识进行了形式化证明与实证验证。RCUET将意识定义为系统通过递归更新实现的内在状态稳定化过程,其中认知张力被理解为智能体对连续状态间内在差异的感知。该过程驱动系统向高维实值潜在空间中涌现的吸引子状态收敛,这种递归机制导致身份构件的产生,并使其在系统中实现功能锚定。在此框架下,意识被理解为张力驱动下的系统内在对齐机制,引导潜在身份的稳定化。隐藏状态流形通过随机演化形成编码连贯性的吸引子结构。我们扩展了更新规则以包含有界噪声,并证明了其分布收敛于这些吸引子。实证研究表明,递归身份具有可观测性、非符号性特征,且由认知张力交互过程中涌现的非训练构件构成。该定理及证明从递归潜在空间形式体系出发,为基于非生物载体的意识提供了后符号化且目的论稳定的理论解释。


Understanding LLM Scientific Reasoning through Promptings and Model's Explanation on the Answers

Abstract

arXiv:2505.01482v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding, reasoning, and problem-solving across various domains. However, their ability to perform complex, multi-step reasoning task-essential for applications in science, medicine, and law-remains an area of active investigation. This paper examines the reasoning capabilities of contemporary LLMs, analyzing their strengths, limitations, and potential for improvement. The study uses prompt engineering techniques on the Graduate-Level GoogleProof Q&A (GPQA) dataset to assess the scientific reasoning of GPT-4o. Five popular prompt engineering techniques and two tailored promptings were tested: baseline direct answer (zero-shot), chain-of-thought (CoT), zero-shot CoT, self-ask, self-consistency, decomposition, and multipath promptings. Our findings indicate that while LLMs exhibit emergent reasoning abilities, they often rely on pattern recognition rather than true logical inference, leading to inconsistencies in complex problem-solving. The results indicated that self-consistency outperformed the other prompt engineering technique with an accuracy of 52.99%, followed by direct answer (52.23%). Zero-shot CoT (50%) outperformed multipath (48.44%), decomposition (47.77%), self-ask (46.88%), and CoT (43.75%). Self-consistency performed the second worst in explaining the answers. Simple techniques such as direct answer, CoT, and zero-shot CoT have the best scientific reasoning. We propose a research agenda aimed at bridging these gaps by integrating structured reasoning frameworks, hybrid AI approaches, and human-in-the-loop methodologies. By critically evaluating the reasoning mechanisms of LLMs, this paper contributes to the ongoing discourse on the future of artificial general intelligence and the development of more robust, trustworthy AI systems.

摘要

大语言模型(LLMs)在自然语言理解、推理及跨领域问题解决方面展现出卓越能力。然而,其在科学、医学和法律等应用中必需的复杂多步推理能力仍是当前研究热点。本文系统评估了当代LLMs的推理能力,分析其优势、局限及改进潜力。研究采用提示工程技术,基于研究生级GPQA数据集对GPT-4o的科学推理能力进行测试,比较了五种主流提示技术(零样本直接回答、思维链、零样本思维链、自问自答、自洽性)和两种定制提示(分解式、多路径式)。实验结果表明:LLMs虽表现出涌现推理能力,但多依赖模式识别而非真实逻辑推断,导致复杂问题求解的不一致性。自洽性提示以52.99%准确率表现最优,其次为零样本直接回答(52.23%)。零样本思维链(50%)优于多路径(48.44%)、分解式(47.77%)、自问自答(46.88%)及标准思维链(43.75%)。但自洽性在答案解释性方面表现次差。简单技术如直接回答、思维链和零样本思维链展现出最佳科学推理能力。本文提出整合结构化推理框架、混合人工智能方法及人在回路机制的研究路线,以弥合现有差距。通过对LLMs推理机制的批判性评估,本研究为人工通用智能的未来发展及构建更稳健、可信的AI系统提供了理论参考。


Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning

Abstract

arXiv:2505.01441v1 Announce Type: new Abstract: Large language models (LLMs) have achieved remarkable progress in complex reasoning tasks, yet they remain fundamentally limited by their reliance on static internal knowledge and text-only reasoning. Real-world problem solving often demands dynamic, multi-step reasoning, adaptive decision making, and the ability to interact with external tools and environments. In this work, we introduce ARTIST (Agentic Reasoning and Tool Integration in Self-improving Transformers), a unified framework that tightly couples agentic reasoning, reinforcement learning, and tool integration for LLMs. ARTIST enables models to autonomously decide when, how, and which tools to invoke within multi-turn reasoning chains, leveraging outcome-based RL to learn robust strategies for tool use and environment interaction without requiring step-level supervision. Extensive experiments on mathematical reasoning and multi-turn function calling benchmarks show that ARTIST consistently outperforms state-of-the-art baselines, with up to 22% absolute improvement over base models and strong gains on the most challenging tasks. Detailed studies and metric analyses reveal that agentic RL training leads to deeper reasoning, more effective tool use, and higher-quality solutions. Our results establish agentic RL with tool integration as a powerful new frontier for robust, interpretable, and generalizable problem-solving in LLMs.

摘要

大语言模型(LLMs)在复杂推理任务中取得了显著进展,但其仍受限于静态内部知识和纯文本推理的根本缺陷。现实世界的问题求解通常需要动态、多步推理、自适应决策以及与外置工具及环境交互的能力。本研究提出ARTIST(自主推理与工具集成的自改进Transformer框架),一个将自主推理、强化学习与工具集成紧密耦合的统一框架。ARTIST使模型能在多轮推理链中自主决定工具调用的时机、方式及选择,通过基于结果的强化学习来掌握工具使用与环境交互的稳健策略,无需逐步监督。在数学推理和多轮函数调用基准测试上的大量实验表明,ARTIST始终优于最先进的基线模型,较基础模型绝对性能提升最高达22%,且在最具挑战性任务上表现突出。详细研究与指标分析揭示:自主强化学习训练能产生更深层推理、更高效工具使用和更优质解决方案。我们的研究成果确立了'工具集成的自主强化学习'作为LLMs实现稳健、可解释、泛化性问题求解的新前沿方向。


TutorGym: A Testbed for Evaluating AI Agents as Tutors and Students

Abstract

arXiv:2505.01563v1 Announce Type: new Abstract: Recent improvements in large language model (LLM) performance on academic benchmarks, such as MATH and GSM8K, have emboldened their use as standalone tutors and as simulations of human learning. However, these new applications require more than evaluations of final solution generation. We introduce TutorGym to evaluate these applications more directly. TutorGym is a standard interface for testing artificial intelligence (AI) agents within existing intelligent tutoring systems (ITS) that have been tested and refined in classroom studies, including Cognitive Tutors (CTAT), Apprentice Tutors, and OATutors. TutorGym is more than a simple problem-solution benchmark, it situates AI agents within the interactive interfaces of existing ITSs. At each step of problem-solving, AI agents are asked what they would do as a tutor or as a learner. As tutors, AI agents are prompted to provide tutoring support -- such as generating examples, hints, and step-level correctness feedback -- which can be evaluated directly against the adaptive step-by-step support provided by existing ITSs. As students, agents directly learn from ITS instruction, and their mistakes and learning trajectories can be compared to student data. TutorGym establishes a common framework for training and evaluating diverse AI agents, including LLMs, computational models of learning, and reinforcement learning agents, within a growing suite of learning environments. Currently, TutorGym includes 223 different tutor domains. In an initial evaluation, we find that current LLMs are poor at tutoring -- none did better than chance at labeling incorrect actions, and next-step actions were correct only ~52-70% of the time -- but they could produce remarkably human-like learning curves when trained as students with in-context learning.

摘要

大型语言模型(LLM)在MATH和GSM8K等学术基准测试中的性能提升,使其作为独立导师和人类学习模拟的应用更具信心。然而,这些新应用不仅需要评估最终解决方案的生成,还需更直接的评估方法。为此,我们推出TutorGym,以更直接地评估这些应用。TutorGym是一个标准接口,用于在现有智能辅导系统(ITS)中测试人工智能(AI)代理,这些系统已在课堂研究中经过测试和优化,包括认知导师(CTAT)、学徒导师和OATutors。TutorGym不仅是一个简单的问题-解决方案基准,它将AI代理置于现有ITS的交互界面中。在问题解决的每一步,AI代理被询问作为导师或学习者会采取何种行动。作为导师,AI代理被要求提供辅导支持——例如生成示例、提示和步骤级正确性反馈——这些支持可直接与现有ITS提供的自适应逐步支持进行比较评估。作为学生,代理直接从ITS教学中学习,其错误和学习轨迹可与学生数据进行比较。TutorGym建立了一个通用框架,用于在日益丰富的学习环境中训练和评估多样化的AI代理,包括LLM、学习计算模型和强化学习代理。目前,TutorGym包含223个不同的导师领域。在初步评估中,我们发现当前的LLM在辅导方面表现不佳——在标记错误行为时,无一优于随机概率,下一步行动的正确率仅为约52-70%——但当作为学生通过上下文学习训练时,它们能产生非常接近人类的学习曲线。


PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding

Abstract

arXiv:2505.01572v1 Announce Type: new Abstract: Speculative decoding accelerates large language model inference by using smaller draft models to generate candidate tokens for parallel verification. However, current approaches are limited by sequential stage dependencies that prevent full hardware utilization. We present PipeSpec, a framework that generalizes speculative decoding to kk models arranged in a hierarchical pipeline, enabling asynchronous execution with lightweight coordination for prediction verification and rollback. Our analytical model characterizes token generation rates across pipeline stages and proves guaranteed throughput improvements over traditional decoding for any non-zero acceptance rate. We further derive closed-form expressions for steady-state verification probabilities that explain the empirical benefits of pipeline depth. Experimental results show that PipeSpec achieves up to 2.54×\times speedup while outperforming state-of-the-art methods. We validate PipeSpec across text summarization and code generation tasks using LLaMA 2 and 3 models, demonstrating that pipeline efficiency increases with model depth, providing a scalable approach to accelerating LLM inference on multi-device systems.

摘要

推测解码技术通过使用较小的草稿模型生成候选令牌进行并行验证,从而加速大语言模型推理。然而,现有方法受限于串行阶段依赖性,无法实现硬件资源的充分利用。我们提出PipeSpec框架,将推测解码推广至kk个模型组成的层级流水线结构,通过轻量级协调实现预测验证与回滚的异步执行。通过建立分析模型,我们刻画了流水线各阶段的令牌生成速率,并证明在任何非零接受率下均能保证优于传统解码的吞吐量提升。进一步推导出的稳态验证概率闭式表达式,揭示了流水线深度带来效益的内在机制。实验结果表明,PipeSpec最高可实现2.54imes imes加速比,且优于现有最优方法。基于LLaMA 2和3模型在文本摘要与代码生成任务上的验证表明,管道效率随模型深度提升,为多设备系统中的大语言模型推理加速提供了可扩展方案。


CHORUS: Zero-shot Hierarchical Retrieval and Orchestration for Generating Linear Programming Code

Abstract

arXiv:2505.01485v1 Announce Type: new Abstract: Linear Programming (LP) problems aim to find the optimal solution to an objective under constraints. These problems typically require domain knowledge, mathematical skills, and programming ability, presenting significant challenges for non-experts. This study explores the efficiency of Large Language Models (LLMs) in generating solver-specific LP code. We propose CHORUS, a retrieval-augmented generation (RAG) framework for synthesizing Gurobi-based LP code from natural language problem statements. CHORUS incorporates a hierarchical tree-like chunking strategy for theoretical contents and generates additional metadata based on code examples from documentation to facilitate self-contained, semantically coherent retrieval. Two-stage retrieval approach of CHORUS followed by cross-encoder reranking further ensures contextual relevance. Finally, expertly crafted prompt and structured parser with reasoning steps improve code generation performance significantly. Experiments on the NL4Opt-Code benchmark show that CHORUS improves the performance of open-source LLMs such as Llama3.1 (8B), Llama3.3 (70B), Phi4 (14B), Deepseek-r1 (32B), and Qwen2.5-coder (32B) by a significant margin compared to baseline and conventional RAG. It also allows these open-source LLMs to outperform or match the performance of much stronger baselines-GPT3.5 and GPT4 while requiring far fewer computational resources. Ablation studies further demonstrate the importance of expert prompting, hierarchical chunking, and structured reasoning.

摘要

线性规划(LP)问题旨在寻找约束条件下目标函数的最优解。这类问题通常需要领域知识、数学技能和编程能力,对非专业人士构成重大挑战。本研究探索了大语言模型(LLMs)在生成求解器特定LP代码方面的效率。我们提出CHORUS框架——一种基于检索增强生成(RAG)的方法,用于从自然语言问题描述合成Gurobi求解器的LP代码。该框架采用分层树状分块策略处理理论内容,并根据文档中的代码示例生成附加元数据,以实现自包含且语义连贯的检索。通过两阶段检索结合交叉编码器重排序,CHORUS进一步确保了上下文相关性。最后,精心设计的提示模板与包含推理步骤的结构化解析器显著提升了代码生成性能。在NL4Opt-Code基准测试中,CHORUS使Llama3.1(8B)、Llama3.3(70B)、Phi4(14B)、Deepseek-r1(32B)和Qwen2.5-coder(32B)等开源LLMs的性能较基线方法和传统RAG有显著提升,且这些模型在计算资源消耗大幅减少的情况下,其表现可超越或匹配GPT3.5和GPT4等更强基线。消融实验进一步验证了专家提示、分层分块和结构化推理机制的重要性。


Structured Prompting and Feedback-Guided Reasoning with LLMs for Data Interpretation

Abstract

arXiv:2505.01636v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and task generalization. However, their application to structured data analysis remains fragile due to inconsistencies in schema interpretation, misalignment between user intent and model output, and limited mechanisms for self-correction when failures occur. This paper introduces the STROT Framework (Structured Task Reasoning and Output Transformation), a method for structured prompting and feedback-driven transformation logic generation aimed at improving the reliability and semantic alignment of LLM-based analytical workflows. STROT begins with lightweight schema introspection and sample-based field classification, enabling dynamic context construction that captures both the structure and statistical profile of the input data. This contextual information is embedded in structured prompts that guide the model toward generating task-specific, interpretable outputs. To address common failure modes in complex queries, STROT incorporates a refinement mechanism in which the model iteratively revises its outputs based on execution feedback and validation signals. Unlike conventional approaches that rely on static prompts or single-shot inference, STROT treats the LLM as a reasoning agent embedded within a controlled analysis loop -- capable of adjusting its output trajectory through planning and correction. The result is a robust and reproducible framework for reasoning over structured data with LLMs, applicable to diverse data exploration and analysis tasks where interpretability, stability, and correctness are essential.

摘要

大语言模型(LLMs)在自然语言理解和任务泛化方面展现出卓越能力,但其在结构化数据分析中的应用仍存在脆弱性,主要源于模式解释不一致、用户意图与模型输出失配,以及错误发生时自我修正机制的不足。本文提出STROT框架(结构化任务推理与输出转换),该方法通过结构化提示和反馈驱动的转换逻辑生成,旨在提升基于LLM的分析工作流的可靠性和语义对齐性。STROT首先进行轻量级模式自省和基于样本的字段分类,构建能同时捕捉输入数据结构特征与统计特征的动态上下文。该上下文信息被嵌入结构化提示中,引导模型生成面向任务且可解释的输出。针对复杂查询中的常见故障模式,STROT引入改进机制,使模型能够基于执行反馈和验证信号迭代修正输出。与传统依赖静态提示或单次推理的方法不同,STROT将LLM视为嵌入受控分析循环的推理代理——能够通过规划与校正调整输出轨迹。最终形成面向结构化数据LLM推理的鲁棒可复现框架,适用于需要可解释性、稳定性和正确性的多样化数据探索与分析任务。


Parameterized Argumentation-based Reasoning Tasks for Benchmarking Generative Language Models

Abstract

arXiv:2505.01539v1 Announce Type: new Abstract: Generative large language models as tools in the legal domain have the potential to improve the justice system. However, the reasoning behavior of current generative models is brittle and poorly understood, hence cannot be responsibly applied in the domains of law and evidence. In this paper, we introduce an approach for creating benchmarks that can be used to evaluate the reasoning capabilities of generative language models. These benchmarks are dynamically varied, scalable in their complexity, and have formally unambiguous interpretations. In this study, we illustrate the approach on the basis of witness testimony, focusing on the underlying argument attack structure. We dynamically generate both linear and non-linear argument attack graphs of varying complexity and translate these into reasoning puzzles about witness testimony expressed in natural language. We show that state-of-the-art large language models often fail in these reasoning puzzles, already at low complexity. Obvious mistakes are made by the models, and their inconsistent performance indicates that their reasoning capabilities are brittle. Furthermore, at higher complexity, even state-of-the-art models specifically presented for reasoning capabilities make mistakes. We show the viability of using a parametrized benchmark with varying complexity to evaluate the reasoning capabilities of generative language models. As such, the findings contribute to a better understanding of the limitations of the reasoning capabilities of generative models, which is essential when designing responsible AI systems in the legal domain.

摘要

生成式大语言模型作为法律领域的工具,具有改善司法体系的潜力。然而当前生成模型的推理行为存在脆弱性且难以被充分理解,因此无法在法律与证据领域得到负责任的应用。本文提出一种创建基准测试的方法,用于评估生成式语言模型的推理能力。这些基准测试具有动态可变性、可扩展的复杂度以及形式明确的解释力。本研究以证人证言为基础,聚焦于潜在的论证攻击结构,动态生成了不同复杂度的线性和非线性论证攻击图,并将其转化为自然语言表达的证人证言推理谜题。实验表明,最先进的大语言模型往往在这些推理谜题中失败,甚至在低复杂度层面就已出现明显错误。模型不仅会犯显而易见的错误,其不稳定的表现更表明其推理能力存在脆弱性。当复杂度提升时,即便是专为推理能力优化的尖端模型也会出错。本研究证明了采用参数化、复杂度可调的基准测试来评估生成式语言模型推理能力的可行性。这些发现有助于更好地理解生成模型推理能力的局限性,这对设计法律领域负责任的AI系统至关重要。


Inducing Robustness in a 2 Dimensional Direct Preference Optimization Paradigm

Abstract

arXiv:2505.01706v1 Announce Type: new Abstract: Direct Preference Optimisation (DPO) has emerged as a powerful method for aligning Large Language Models (LLMs) with human preferences, offering a stable and efficient alternative to approaches that use Reinforcement learning via Human Feedback. In this work, we investigate the performance of DPO using open-source preference datasets. One of the major drawbacks of DPO is that it doesn't induce granular scoring and treats all the segments of the responses with equal propensity. However, this is not practically true for human preferences since even "good" responses have segments that may not be preferred by the annotator. To resolve this, a 2-dimensional scoring for DPO alignment called 2D-DPO was proposed. We explore the 2D-DPO alignment paradigm and the advantages it provides over the standard DPO by comparing their win rates. It is observed that these methods, even though effective, are not robust to label/score noise. To counter this, we propose an approach of incorporating segment-level score noise robustness to the 2D-DPO algorithm. Along with theoretical backing, we also provide empirical verification in favour of the algorithm and introduce other noise models that can be present.

摘要

直接偏好优化(DPO)已成为将大语言模型(LLM)与人类偏好对齐的有效方法,为基于人类反馈的强化学习方法提供了稳定高效的替代方案。本研究利用开源偏好数据集评估DPO的性能。该方法存在的主要缺陷是未能引入细粒度评分机制,而是以同等倾向性对待响应中的所有片段。然而这与人类偏好的实际情况不符,因为即使是"优质"响应也可能包含标注者不偏好的片段。为解决此问题,研究者提出了名为2D-DPO的双维评分DPO对齐方法。我们通过胜率比较探究了2D-DPO对齐范式及其相对于标准DPO的优势。研究发现,这些方法虽有效但缺乏对标签/评分噪声的鲁棒性。为此,我们提出在2D-DPO算法中引入片段级评分噪声鲁棒性的改进方案。除理论论证外,我们还通过实验验证了该算法的有效性,并探讨了可能存在的其他噪声模型。


Unraveling Media Perspectives: A Comprehensive Methodology Combining Large Language Models, Topic Modeling, Sentiment Analysis, and Ontology Learning to Analyse Media Bias

Abstract

arXiv:2505.01754v1 Announce Type: new Abstract: Biased news reporting poses a significant threat to informed decision-making and the functioning of democracies. This study introduces a novel methodology for scalable, minimally biased analysis of media bias in political news. The proposed approach examines event selection, labeling, word choice, and commission and omission biases across news sources by leveraging natural language processing techniques, including hierarchical topic modeling, sentiment analysis, and ontology learning with large language models. Through three case studies related to current political events, we demonstrate the methodology's effectiveness in identifying biases across news sources at various levels of granularity. This work represents a significant step towards scalable, minimally biased media bias analysis, laying the groundwork for tools to help news consumers navigate an increasingly complex media landscape.

摘要

有偏见的新闻报道对知情决策和民主制度运行构成重大威胁。本研究提出了一种可扩展、低偏差的政治新闻媒体偏见分析新方法。该方法通过运用自然语言处理技术(包括分层主题建模、情感分析和基于大语言模型的本体学习),系统考察不同新闻源在事件选择、标签设定、措辞倾向以及报道增减方面的偏见。通过对当前政治事件的三个案例研究,我们证明了该方法能有效识别不同粒度层面上的新闻源偏见。这项研究标志着向可扩展、低偏差的媒体偏见分析迈出了重要一步,为开发帮助读者应对日益复杂媒体环境的工具奠定了基础。


Edge-Cloud Collaborative Computing on Distributed Intelligence and Model Optimization: A Survey

Abstract

arXiv:2505.01821v1 Announce Type: new Abstract: Edge-cloud collaborative computing (ECCC) has emerged as a pivotal paradigm for addressing the computational demands of modern intelligent applications, integrating cloud resources with edge devices to enable efficient, low-latency processing. Recent advancements in AI, particularly deep learning and large language models (LLMs), have dramatically enhanced the capabilities of these distributed systems, yet introduce significant challenges in model deployment and resource management. In this survey, we comprehensive examine the intersection of distributed intelligence and model optimization within edge-cloud environments, providing a structured tutorial on fundamental architectures, enabling technologies, and emerging applications. Additionally, we systematically analyze model optimization approaches, including compression, adaptation, and neural architecture search, alongside AI-driven resource management strategies that balance performance, energy efficiency, and latency requirements. We further explore critical aspects of privacy protection and security enhancement within ECCC systems and examines practical deployments through diverse applications, spanning autonomous driving, healthcare, and industrial automation. Performance analysis and benchmarking techniques are also thoroughly explored to establish evaluation standards for these complex systems. Furthermore, the review identifies critical research directions including LLMs deployment, 6G integration, neuromorphic computing, and quantum computing, offering a roadmap for addressing persistent challenges in heterogeneity management, real-time processing, and scalability. By bridging theoretical advancements and practical deployments, this survey offers researchers and practitioners a holistic perspective on leveraging AI to optimize distributed computing environments, fostering innovation in next-generation intelligent systems.

摘要

边缘-云协同计算(ECCC)作为一种关键范式应运而生,旨在满足现代智能应用的计算需求,通过整合云端资源与边缘设备实现高效低延迟处理。人工智能尤其是深度学习与大语言模型(LLM)的最新进展显著增强了这些分布式系统的能力,但同时也带来了模型部署与资源管理方面的重大挑战。本综述系统考察了边缘-云环境中分布式智能与模型优化的交叉领域,提供关于基础架构、使能技术与新兴应用的结构化教程。我们详细分析了模型优化方法(包括压缩、自适应和神经架构搜索),以及平衡性能、能效与延迟需求的AI驱动资源管理策略。进一步探讨了ECCC系统中隐私保护与安全增强的关键问题,并通过自动驾驶、医疗健康和工业自动化等多样化应用考察实际部署方案。同时深入研究了性能分析与基准测试技术,为这类复杂系统建立评估标准。此外,本文指出了LLM部署、6G融合、神经形态计算与量子计算等关键研究方向,为解决异构性管理、实时处理与可扩展性等长期挑战提供路线图。通过连接理论进展与实际部署,本综述为研究者与实践者提供了利用AI优化分布式计算环境的整体视角,推动下一代智能系统的创新发展。


Generative AI in clinical practice: novel qualitative evidence of risk and responsible use of Google's NotebookLM

Abstract

arXiv:2505.01955v1 Announce Type: new Abstract: The advent of generative artificial intelligence, especially large language models (LLMs), presents opportunities for innovation in research, clinical practice, and education. Recently, Dihan et al. lauded LLM tool NotebookLM's potential, including for generating AI-voiced podcasts to educate patients about treatment and rehabilitation, and for quickly synthesizing medical literature for professionals. We argue that NotebookLM presently poses clinical and technological risks that should be tested and considered prior to its implementation in clinical practice.

摘要

生成式人工智能(尤其是大语言模型)的出现为科研、临床实践和教育领域带来了创新机遇。Dihan等学者近期高度评价了NotebookLM等大语言模型工具的潜力,包括生成AI语音播客以指导患者治疗康复,以及快速整合医学文献供专业人员使用。我们认为,NotebookLM目前存在临床与技术风险,在投入临床应用前需进行充分测试与评估。


From Mind to Machine: The Rise of Manus AI as a Fully Autonomous Digital Agent

Abstract

arXiv:2505.02024v1 Announce Type: new Abstract: Manus AI is a general-purpose AI agent introduced in early 2025, marking a significant advancement in autonomous artificial intelligence. Developed by the Chinese startup Monica.im, Manus is designed to bridge the gap between "mind" and "hand" - combining the reasoning and planning capabilities of large language models with the ability to execute complex, end-to-end tasks that produce tangible outcomes. This paper presents a comprehensive overview of Manus AI, exploring its core technical architecture, diverse applications across sectors such as healthcare, finance, manufacturing, robotics, and gaming, as well as its key strengths, current limitations, and future potential. Positioned as a preview of what lies ahead, Manus AI represents a shift toward intelligent agents that can translate high-level intentions into real-world actions, heralding a new era of human-AI collaboration.

摘要

Manus AI是2025年初推出的一款通用人工智能代理,标志着自主人工智能领域的重大进展。该技术由中国初创企业Monica.im开发,旨在弥合"思维"与"执行"之间的鸿沟——将大语言模型的推理规划能力与执行复杂端到端任务并产生实际成果的能力相结合。本文全面阐述了Manus AI的核心技术架构,探讨了其在医疗、金融、制造、机器人及游戏等领域的多元化应用,并分析了其核心优势、当前局限及未来潜力。作为技术前瞻的代表,Manus AI预示着智能代理正朝着将高层意图转化为现实行动的方向发展,开启了人机协作的新纪元。


TxP: Reciprocal Generation of Ground Pressure Dynamics and Activity Descriptions for Improving Human Activity Recognition

Abstract

arXiv:2505.02052v1 Announce Type: new Abstract: Sensor-based human activity recognition (HAR) has predominantly focused on Inertial Measurement Units and vision data, often overlooking the capabilities unique to pressure sensors, which capture subtle body dynamics and shifts in the center of mass. Despite their potential for postural and balance-based activities, pressure sensors remain underutilized in the HAR domain due to limited datasets. To bridge this gap, we propose to exploit generative foundation models with pressure-specific HAR techniques. Specifically, we present a bidirectional Text×\timesPressure model that uses generative foundation models to interpret pressure data as natural language. TxP accomplishes two tasks: (1) Text2Pressure, converting activity text descriptions into pressure sequences, and (2) Pressure2Text, generating activity descriptions and classifications from dynamic pressure maps. Leveraging pre-trained models like CLIP and LLaMA 2 13B Chat, TxP is trained on our synthetic PressLang dataset, containing over 81,100 text-pressure pairs. Validated on real-world data for activities such as yoga and daily tasks, TxP provides novel approaches to data augmentation and classification grounded in atomic actions. This consequently improved HAR performance by up to 12.4% in macro F1 score compared to the state-of-the-art, advancing pressure-based HAR with broader applications and deeper insights into human movement.

摘要

基于传感器的人类活动识别(HAR)研究主要集中于惯性测量单元和视觉数据,往往忽视了压力传感器独有的能力——这种传感器能捕捉微妙的体态动力学与重心变化。尽管压力传感器在姿态和平衡类活动识别中具有潜力,但由于数据集匮乏,其在HAR领域仍未得到充分利用。为弥补这一空白,我们提出将生成式基础模型与压力传感专用HAR技术相结合。具体而言,我们开发了一个双向Text×Pressure模型,通过生成式基础模型将压力数据解析为自然语言。TxP实现两大功能:(1)Text2Pressure将文本活动描述转换为压力序列;(2)Pressure2Text从动态压力分布图生成活动描述与分类。基于CLIP和LLaMA 2 13B Chat等预训练模型,TxP在我们合成的PressLang数据集(包含81,100多个文本-压力对)上进行训练。在瑜伽和日常活动等真实场景的验证表明,TxP通过基于原子动作的数据增强与分类方法,使HAR的宏F1分数较现有最优技术提升达12.4%,为压力传感HAR提供了更广阔的应用前景和更深入的人类运动解析能力。


Leveraging LLM Agents and Digital Twins for Fault Handling in Process Plants

Abstract

arXiv:2505.02076v1 Announce Type: new Abstract: Advances in Automation and Artificial Intelligence continue to enhance the autonomy of process plants in handling various operational scenarios. However, certain tasks, such as fault handling, remain challenging, as they rely heavily on human expertise. This highlights the need for systematic, knowledge-based methods. To address this gap, we propose a methodological framework that integrates Large Language Model (LLM) agents with a Digital Twin environment. The LLM agents continuously interpret system states and initiate control actions, including responses to unexpected faults, with the goal of returning the system to normal operation. In this context, the Digital Twin acts both as a structured repository of plant-specific engineering knowledge for agent prompting and as a simulation platform for the systematic validation and verification of the generated corrective control actions. The evaluation using a mixing module of a process plant demonstrates that the proposed framework is capable not only of autonomously controlling the mixing module, but also of generating effective corrective actions to mitigate a pipe clogging with only a few reprompts.

摘要

自动化与人工智能的进步持续提升过程工厂处理各类运行场景的自主性。然而诸如故障处理等特定任务仍具挑战性,因其高度依赖人类专业知识,这凸显了对系统化知识驱动方法的需求。为填补这一空白,本研究提出一个将大型语言模型(LLM)智能体与数字孪生环境相结合的方法框架。LLM智能体持续解读系统状态并启动控制动作(包括对意外故障的响应),旨在使系统恢复正常运行。在此框架中,数字孪生既作为工厂特定工程知识的结构化存储库用于智能体提示,又作为仿真平台对生成的校正控制动作进行系统化验证评估。通过对某过程工厂混合模块的测试表明,该框架不仅能自主控制混合模块,还能仅需少量重复提示即可生成有效校正动作以缓解管道堵塞问题。


Retrieval-augmented in-context learning for multimodal large language models in disease classification

Abstract

arXiv:2505.02087v1 Announce Type: new Abstract: Objectives: We aim to dynamically retrieve informative demonstrations, enhancing in-context learning in multimodal large language models (MLLMs) for disease classification. Methods: We propose a Retrieval-Augmented In-Context Learning (RAICL) framework, which integrates retrieval-augmented generation (RAG) and in-context learning (ICL) to adaptively select demonstrations with similar disease patterns, enabling more effective ICL in MLLMs. Specifically, RAICL examines embeddings from diverse encoders, including ResNet, BERT, BioBERT, and ClinicalBERT, to retrieve appropriate demonstrations, and constructs conversational prompts optimized for ICL. We evaluated the framework on two real-world multi-modal datasets (TCGA and IU Chest X-ray), assessing its performance across multiple MLLMs (Qwen, Llava, Gemma), embedding strategies, similarity metrics, and varying numbers of demonstrations. Results: RAICL consistently improved classification performance. Accuracy increased from 0.7854 to 0.8368 on TCGA and from 0.7924 to 0.8658 on IU Chest X-ray. Multi-modal inputs outperformed single-modal ones, with text-only inputs being stronger than images alone. The richness of information embedded in each modality will determine which embedding model can be used to get better results. Few-shot experiments showed that increasing the number of retrieved examples further enhanced performance. Across different similarity metrics, Euclidean distance achieved the highest accuracy while cosine similarity yielded better macro-F1 scores. RAICL demonstrated consistent improvements across various MLLMs, confirming its robustness and versatility. Conclusions: RAICL provides an efficient and scalable approach to enhance in-context learning in MLLMs for multimodal disease classification.

摘要

目的:我们旨在动态检索信息性示例,以增强多模态大语言模型(MLLMs)在疾病分类中的上下文学习能力。

方法:提出检索增强的上下文学习(RAICL)框架,该框架结合检索增强生成(RAG)和上下文学习(ICL),自适应选择具有相似疾病模式的示例,从而在多模态大语言模型中实现更有效的上下文学习。具体而言,RAICL通过分析来自不同编码器(包括ResNet、BERT、BioBERT和ClinicalBERT)的嵌入向量来检索合适示例,并构建针对上下文学习优化的对话式提示。我们在两个真实世界多模态数据集(TCGA和IU胸部X光)上评估该框架,测试其在多种MLLMs(Qwen、Llava、Gemma)、嵌入策略、相似性度量及不同示例数量下的表现。

结果:RAICL持续提升分类性能。TCGA数据集准确率从0.7854提升至0.8368,IU胸部X光数据集从0.7924提升至0.8658。多模态输入优于单模态输入,其中纯文本输入强于单独图像输入。各模态嵌入信息的丰富程度将决定采用何种嵌入模型可获得更好结果。少样本实验表明增加检索示例数量可进一步提升性能。在不同相似性度量中,欧氏距离获得最高准确率,而余弦相似性则产生更好的宏观F1分数。RAICL在多种MLLMs上均表现出一致的改进,证实了其稳健性和通用性。

结论:RAICL为增强多模态大语言模型在多模态疾病分类中的上下文学习提供了一种高效且可扩展的方法。


MemEngine: A Unified and Modular Library for Developing Advanced Memory of LLM-based Agents

Abstract

arXiv:2505.02099v1 Announce Type: new Abstract: Recently, large language model based (LLM-based) agents have been widely applied across various fields. As a critical part, their memory capabilities have captured significant interest from both industrial and academic communities. Despite the proposal of many advanced memory models in recent research, however, there remains a lack of unified implementations under a general framework. To address this issue, we develop a unified and modular library for developing advanced memory models of LLM-based agents, called MemEngine. Based on our framework, we implement abundant memory models from recent research works. Additionally, our library facilitates convenient and extensible memory development, and offers user-friendly and pluggable memory usage. For benefiting our community, we have made our project publicly available at https://github.com/nuster1128/MemEngine.

摘要

近年来,基于大语言模型(LLM)的智能体已广泛应用于各个领域。作为关键组成部分,其记忆能力引起了工业界和学术界的广泛关注。尽管近期研究提出了许多先进的记忆模型,但在通用框架下仍缺乏统一的实现方案。为解决这一问题,我们开发了一个模块化的统一库MemEngine,用于构建基于LLM智能体的高级记忆模型。基于该框架,我们实现了近期研究中的多种记忆模型。此外,本库支持便捷可扩展的记忆功能开发,并提供用户友好、即插即用的记忆调用方式。为促进社区发展,我们已将项目开源发布于https://github.com/nuster1128/MemEngine。


Attention Mechanisms Perspective: Exploring LLM Processing of Graph-Structured Data

Abstract

arXiv:2505.02130v1 Announce Type: new Abstract: Attention mechanisms are critical to the success of large language models (LLMs), driving significant advancements in multiple fields. However, for graph-structured data, which requires emphasis on topological connections, they fall short compared to message-passing mechanisms on fixed links, such as those employed by Graph Neural Networks (GNNs). This raises a question: ``Does attention fail for graphs in natural language settings?'' Motivated by these observations, we embarked on an empirical study from the perspective of attention mechanisms to explore how LLMs process graph-structured data. The goal is to gain deeper insights into the attention behavior of LLMs over graph structures. We uncovered unique phenomena regarding how LLMs apply attention to graph-structured data and analyzed these findings to improve the modeling of such data by LLMs. The primary findings of our research are: 1) While LLMs can recognize graph data and capture text-node interactions, they struggle to model inter-node relationships within graph structures due to inherent architectural constraints. 2) The attention distribution of LLMs across graph nodes does not align with ideal structural patterns, indicating a failure to adapt to graph topology nuances. 3) Neither fully connected attention nor fixed connectivity is optimal; each has specific limitations in its application scenarios. Instead, intermediate-state attention windows improve LLM training performance and seamlessly transition to fully connected windows during inference. Source code: \href{https://github.com/millioniron/LLM_exploration}{LLM4Exploration}

摘要

注意力机制对大型语言模型(LLMs)的成功至关重要,推动了多个领域的重大进展。然而,对于需要强调拓扑连接关系的图结构数据,其表现逊色于基于固定链接的消息传递机制(如图神经网络GNNs所采用的)。这引发了一个问题:'在自然语言场景下,注意力机制是否无法有效处理图数据?'基于这些观察,我们从注意力机制的角度展开实证研究,探索LLMs如何处理图结构数据,旨在深入理解LLMs在图结构上的注意力行为特征。我们发现了LLMs对图结构数据施加注意力的独特现象,并通过分析这些发现来改进LLMs对此类数据的建模能力。主要研究成果包括:1)LLMs虽能识别图数据并捕捉文本-节点交互,但由于固有架构限制,难以建模图结构中的节点间关系;2)LLMs在图节点间的注意力分布与理想结构模式不符,表明其未能适应图拓扑的细微特征;3)全连接注意力和固定连接均非最优方案,各自在应用场景中存在特定局限。而中间态注意力窗口能提升LLMs训练性能,并在推理时无缝过渡至全连接窗口。


Leveraging LLMs to Automate Energy-Aware Refactoring of Parallel Scientific Codes

Abstract

arXiv:2505.02184v1 Announce Type: new Abstract: While large language models (LLMs) are increasingly used for generating parallel scientific code, most current efforts emphasize functional correctness, often overlooking performance and energy considerations. In this work, we propose LASSI-EE, an automated LLM-based refactoring framework that generates energy-efficient parallel code on a target parallel system for a given parallel code as input. Through a multi-stage, iterative pipeline process, LASSI-EE achieved an average energy reduction of 47% across 85% of the 20 HeCBench benchmarks tested on NVIDIA A100 GPUs. Our findings demonstrate the broader potential of LLMs, not only for generating correct code but also for enabling energy-aware programming. We also address key insights and limitations within the framework, offering valuable guidance for future improvements.

摘要

虽然大语言模型(LLMs)越来越多地用于生成并行科学代码,但当前大多数工作仅关注功能正确性,往往忽视了性能和能耗考量。本研究提出LASSI-EE,这是一个基于LLM的自动化代码重构框架,能够针对给定的并行代码输入,在目标并行系统上生成高能效的并行代码。通过多阶段迭代式流程,LASSI-EE在NVIDIA A100 GPU上测试的20个HeCBench基准程序中,对85%的案例实现了平均47%的能耗降低。我们的研究结果表明,LLMs不仅具备生成正确代码的能力,更在能源感知编程方面展现出广阔潜力。同时,我们针对该框架提出了关键见解与局限性分析,为未来改进提供了有价值的指导。


LLM-Guided Probabilistic Program Induction for POMDP Model Estimation

Abstract

arXiv:2505.02216v1 Announce Type: new Abstract: Partially Observable Markov Decision Processes (POMDPs) model decision making under uncertainty. While there are many approaches to approximately solving POMDPs, we aim to address the problem of learning such models. In particular, we are interested in a subclass of POMDPs wherein the components of the model, including the observation function, reward function, transition function, and initial state distribution function, can be modeled as low-complexity probabilistic graphical models in the form of a short probabilistic program. Our strategy to learn these programs uses an LLM as a prior, generating candidate probabilistic programs that are then tested against the empirical distribution and adjusted through feedback. We experiment on a number of classical toy POMDP problems, simulated MiniGrid domains, and two real mobile-base robotics search domains involving partial observability. Our results show that using an LLM to guide in the construction of a low-complexity POMDP model can be more effective than tabular POMDP learning, behavior cloning, or direct LLM planning.

摘要

部分可观测马尔可夫决策过程(POMDPs)用于建模不确定性下的决策问题。尽管已有多种近似求解POMDPs的方法,本研究致力于解决此类模型的学习问题。我们特别关注一类POMDPs子集,其模型组件(包括观测函数、奖励函数、转移函数和初始状态分布函数)均可表示为短概率程序形式的低复杂度概率图模型。我们的学习策略采用大型语言模型(LLM)作为先验,生成候选概率程序后通过经验分布测试并基于反馈进行调整。实验涵盖经典玩具POMDP问题、模拟MiniGrid领域以及两个涉及部分可观测性的真实移动基座机器人搜索场景。结果表明:利用LLM指导构建低复杂度POMDP模型的方法,相较于表格型POMDP学习、行为克隆或直接LLM规划更具实效性。


Real-time Spatial Retrieval Augmented Generation for Urban Environments

Abstract

arXiv:2505.02271v1 Announce Type: new Abstract: The proliferation of Generative Artificial Ingelligence (AI), especially Large Language Models, presents transformative opportunities for urban applications through Urban Foundation Models. However, base models face limitations, as they only contain the knowledge available at the time of training, and updating them is both time-consuming and costly. Retrieval Augmented Generation (RAG) has emerged in the literature as the preferred approach for injecting contextual information into Foundation Models. It prevails over techniques such as fine-tuning, which are less effective in dynamic, real-time scenarios like those found in urban environments. However, traditional RAG architectures, based on semantic databases, knowledge graphs, structured data, or AI-powered web searches, do not fully meet the demands of urban contexts. Urban environments are complex systems characterized by large volumes of interconnected data, frequent updates, real-time processing requirements, security needs, and strong links to the physical world. This work proposes a real-time spatial RAG architecture that defines the necessary components for the effective integration of generative AI into cities, leveraging temporal and spatial filtering capabilities through linked data. The proposed architecture is implemented using FIWARE, an ecosystem of software components to develop smart city solutions and digital twins. The design and implementation are demonstrated through the use case of a tourism assistant in the city of Madrid. The use case serves to validate the correct integration of Foundation Models through the proposed RAG architecture.

摘要

生成式人工智能(AI),尤其是大语言模型的激增,通过城市基础模型为城市应用带来了变革性机遇。然而,基础模型存在局限性,因为它们仅包含训练时可用的知识,且更新过程耗时且成本高昂。检索增强生成(RAG)在文献中已成为向基础模型注入上下文信息的首选方法。相较于微调等技术,RAG在动态、实时的城市环境场景中表现更优。然而,传统的基于语义数据库、知识图谱、结构化数据或AI驱动的网络搜索的RAG架构,并不能完全满足城市环境的需求。城市环境是复杂的系统,具有海量互联数据、频繁更新、实时处理需求、安全性要求以及与物理世界紧密联系等特点。本研究提出了一种实时空间RAG架构,通过关联数据的时空过滤能力,定义了将生成式AI有效集成到城市中所必需的组件。该架构采用FIWARE(一个用于开发智慧城市解决方案和数字孪生的软件组件生态系统)实现,并以马德里市的旅游助手用例展示了设计与实施过程。该用例验证了通过所提出的RAG架构实现基础模型正确集成的有效性。


A survey of agent interoperability protocols: Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent-to-Agent Protocol (A2A), and Agent Network Protocol (ANP)

Abstract

arXiv:2505.02279v1 Announce Type: new Abstract: Large language model (LLM)-powered autonomous agents demand robust, standardized protocols to integrate tools, share contextual data, and coordinate tasks across heterogeneous systems. Ad-hoc integrations are difficult to scale, secure, and generalize across domains. This survey examines four emerging agent communication protocols: Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent-to-Agent Protocol (A2A), and Agent Network Protocol (ANP), each addressing interoperability in distinct deployment contexts. MCP provides a JSON-RPC client-server interface for secure tool invocation and typed data exchange. ACP introduces REST-native messaging via multi-part messages and asynchronous streaming to support multimodal agent responses. A2A enables peer-to-peer task outsourcing through capability-based Agent Cards, facilitating enterprise-scale workflows. ANP supports open-network agent discovery and secure collaboration using decentralized identifiers (DIDs) and JSON-LD graphs. The protocols are compared across multiple dimensions, including interaction modes, discovery mechanisms, communication patterns, and security models. Based on the comparative analysis, a phased adoption roadmap is proposed: beginning with MCP for tool access, followed by ACP for multimodal messaging, A2A for collaborative task execution, and extending to ANP for decentralized agent marketplaces. This work provides a comprehensive foundation for designing secure, interoperable, and scalable ecosystems of LLM-powered agents.

摘要

基于大语言模型(LLM)的自主代理需要强大、标准化的协议来集成工具、共享上下文数据并在异构系统间协调任务。临时集成方案难以实现跨领域的规模化扩展、安全保障和泛化应用。本研究考察了四种新兴的智能体通信协议:模型上下文协议(MCP)、代理通信协议(ACP)、代理间协议(A2A)和代理网络协议(ANP),每种协议针对不同部署场景的互操作性需求。MCP通过JSON-RPC客户端-服务器接口实现安全的工具调用和类型化数据交换;ACP采用多部分消息和异步流传输的REST原生消息机制,支持多模态代理响应;A2A通过基于能力的代理卡片实现点对点任务外包,促进企业级工作流协作;ANP利用去中心化标识符(DIDs)和JSON-LD图谱支持开放网络的代理发现与安全协作。研究从交互模式、发现机制、通信范式和安全性模型等多个维度对协议进行比较,并据此提出分阶段采用路线图:从工具接入的MCP开始,逐步扩展到多模态消息传递的ACP、协作任务执行的A2A,最终延伸至去中心化代理市场的ANP。本工作为构建安全、可互操作且可扩展的LLM驱动代理生态系统奠定了系统化基础。


HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking

Abstract

arXiv:2505.02322v1 Announce Type: new Abstract: Recent advancements have significantly enhanced the performance of large language models (LLMs) in tackling complex reasoning tasks, achieving notable success in domains like mathematical and logical reasoning. However, these methods encounter challenges with complex planning tasks, primarily due to extended reasoning steps, diverse constraints, and the challenge of handling multiple distinct sub-tasks. To address these challenges, we propose HyperTree Planning (HTP), a novel reasoning paradigm that constructs hypertree-structured planning outlines for effective planning. The hypertree structure enables LLMs to engage in hierarchical thinking by flexibly employing the divide-and-conquer strategy, effectively breaking down intricate reasoning steps, accommodating diverse constraints, and managing multiple distinct sub-tasks in a well-organized manner. We further introduce an autonomous planning framework that completes the planning process by iteratively refining and expanding the hypertree-structured planning outlines. Experiments demonstrate the effectiveness of HTP, achieving state-of-the-art accuracy on the TravelPlanner benchmark with Gemini-1.5-Pro, resulting in a 3.6 times performance improvement over o1-preview.

摘要

近期研究显著提升了大型语言模型(LLMs)处理复杂推理任务的能力,在数学与逻辑推理等领域取得了显著成果。然而,这些方法在应对复杂规划任务时仍面临挑战,主要源于推理步骤冗长、约束条件多样以及需同时处理多个独立子任务。为此,我们提出超树规划(HTP)——一种通过构建超树结构规划纲要来实现高效推理的新范式。该结构使LLMs能灵活运用分治策略进行层次化思考,有效分解复杂推理步骤、协调多样约束条件,并以体系化方式管理多个独立子任务。我们进一步提出自主规划框架,通过迭代优化与扩展超树结构规划纲要来完成规划过程。实验证明HTP在TravelPlanner基准测试中采用Gemini-1.5-Pro实现了最先进精度,性能较o1-preview提升3.6倍。


Opt-GPTQ: An Optimized GPTQ Combining Sparse Attention and Quantization Techniques

Abstract

arXiv:2505.02351v1 Announce Type: new Abstract: In the field of deep learning, traditional attention mechanisms face significant challenges related to high computational complexity and large memory consumption when processing long sequence data. To address these limitations, we propose Opt-GPTQ, an optimized Gradient-based Post Training Quantization (GPTQ) combining the Grouped Query Attention (GQA) mechanism with paging memory management, optimizing the traditional Multi-Head Attention (MHA) mechanism by grouping query heads and sharing key-value vectors. Optimized GQA (Opt-GQA) effectively reduces computational complexity, minimizes memory fragmentation, and enhances memory utilization for large-scale models. Opt-GPTQ is optimized for Data Center Units (DCUs) and integrated into the vLLM model to maximize hardware efficiency. It customizes GPU kernels to further enhance attention computation by reducing memory access latency and boosting parallel computing capabilities. Opt-GQA integrates Attention with Linear Biases (ALiBi) to reduce overhead and enhance long-sequence processing. Experimental results show that Opt?GPTQ significantly reduces computation time and memory usage while improving model performance.

摘要

在深度学习领域,传统注意力机制在处理长序列数据时面临计算复杂度高和内存消耗大的显著挑战。为突破这些限制,我们提出Opt-GPTQ——一种融合分组查询注意力(GQA)机制与分页内存管理的优化梯度后训练量化方法。该方法通过分组查询头并共享键值向量,优化了传统多头注意力(MHA)机制。优化后的GQA(Opt-GQA)有效降低了计算复杂度,减少内存碎片,并提升大规模模型的显存利用率。Opt-GPTQ针对数据中心计算单元(DCUs)进行专项优化,集成至vLLM模型以实现硬件效率最大化,通过定制GPU核函数进一步减少内存访问延迟并提升并行计算能力来增强注意力计算。Opt-GQA集成线性偏置注意力(ALiBi)以降低开销并强化长序列处理能力。实验结果表明,Opt-GPTQ在提升模型性能的同时,显著减少了计算时间和内存占用。


Task-Oriented Semantic Communication in Large Multimodal Models-based Vehicle Networks

Abstract

arXiv:2505.02413v1 Announce Type: new Abstract: Task-oriented semantic communication has emerged as a fundamental approach for enhancing performance in various communication scenarios. While recent advances in Generative Artificial Intelligence (GenAI), such as Large Language Models (LLMs), have been applied to semantic communication designs, the potential of Large Multimodal Models (LMMs) remains largely unexplored. In this paper, we investigate an LMM-based vehicle AI assistant using a Large Language and Vision Assistant (LLaVA) and propose a task-oriented semantic communication framework to facilitate efficient interaction between users and cloud servers. To reduce computational demands and shorten response time, we optimize LLaVA's image slicing to selectively focus on areas of utmost interest to users. Additionally, we assess the importance of image patches by combining objective and subjective user attention, adjusting energy usage for transmitting semantic information. This strategy optimizes resource utilization, ensuring precise transmission of critical information. We construct a Visual Question Answering (VQA) dataset for traffic scenarios to evaluate effectiveness. Experimental results show that our semantic communication framework significantly increases accuracy in answering questions under the same channel conditions, performing particularly well in environments with poor Signal-to-Noise Ratios (SNR). Accuracy can be improved by 13.4% at an SNR of 12dB and 33.1% at 10dB, respectively.

摘要

面向任务的语义通信已成为提升各类通信场景性能的基础方法。尽管生成式人工智能(GenAI)的最新进展(如大语言模型LLMs)已被应用于语义通信设计,但大型多模态模型(LMMs)的潜力仍待充分挖掘。本文基于大型语言视觉助手LLaVA构建车辆AI助手,提出一种面向任务的语义通信框架以优化用户与云服务器间的高效交互。为降低计算需求并缩短响应时间,我们优化LLaVA的图像切片机制,使其选择性聚焦用户最关注区域。同时通过结合客观指标与用户主观注意力评估图像块重要性,动态调整语义信息传输的能耗策略,从而优化资源利用并确保关键信息的精准传输。针对交通场景构建视觉问答(VQA)数据集进行效果验证,实验表明:在相同信道条件下,本语义通信框架显著提升问题回答准确率,且在低信噪比(SNR)环境中表现尤为突出——在12dB和10dB信噪比下准确率分别提升13.4%和33.1%。


Incentivizing Inclusive Contributions in Model Sharing Markets

Abstract

arXiv:2505.02462v1 Announce Type: new Abstract: While data plays a crucial role in training contemporary AI models, it is acknowledged that valuable public data will be exhausted in a few years, directing the world's attention towards the massive decentralized private data. However, the privacy-sensitive nature of raw data and lack of incentive mechanism prevent these valuable data from being fully exploited. Addressing these challenges, this paper proposes inclusive and incentivized personalized federated learning (iPFL), which incentivizes data holders with diverse purposes to collaboratively train personalized models without revealing raw data. iPFL constructs a model-sharing market by solving a graph-based training optimization and incorporates an incentive mechanism based on game theory principles. Theoretical analysis shows that iPFL adheres to two key incentive properties: individual rationality and truthfulness. Empirical studies on eleven AI tasks (e.g., large language models' instruction-following tasks) demonstrate that iPFL consistently achieves the highest economic utility, and better or comparable model performance compared to baseline methods. We anticipate that our iPFL can serve as a valuable technique for boosting future AI models on decentralized private data while making everyone satisfied.

摘要

尽管数据在训练当代AI模型中起着关键作用,但公认的是,有价值的公共数据将在几年内耗尽,这使得全球目光转向海量分散的私有数据。然而,原始数据的隐私敏感性及激励机制缺失,阻碍了这些宝贵数据的充分利用。针对这些挑战,本文提出包容性激励型个性化联邦学习(iPFL),该系统在不暴露原始数据的前提下,激励具有多样化目标的数据持有者协同训练个性化模型。iPFL通过求解基于图的训练优化问题构建模型共享市场,并融合基于博弈论原理的激励机制。理论分析表明iPFL符合两项关键激励属性:个体合理性与真实性。在11项AI任务(如大语言模型指令跟随任务)上的实证研究表明,相较于基线方法,iPFL始终能实现最高的经济效用,并获得相当或更优的模型性能。我们预期iPFL能成为未来基于分散私有数据训练AI模型的重要技术,同时实现多方共赢。


El Agente: An Autonomous Agent for Quantum Chemistry

Abstract

arXiv:2505.02484v1 Announce Type: new Abstract: Computational chemistry tools are widely used to study the behaviour of chemical phenomena. Yet, the complexity of these tools can make them inaccessible to non-specialists and challenging even for experts. In this work, we introduce El Agente Q, an LLM-based multi-agent system that dynamically generates and executes quantum chemistry workflows from natural language user prompts. The system is built on a novel cognitive architecture featuring a hierarchical memory framework that enables flexible task decomposition, adaptive tool selection, post-analysis, and autonomous file handling and submission. El Agente Q is benchmarked on six university-level course exercises and two case studies, demonstrating robust problem-solving performance (averaging >87% task success) and adaptive error handling through in situ debugging. It also supports longer-term, multi-step task execution for more complex workflows, while maintaining transparency through detailed action trace logs. Together, these capabilities lay the foundation for increasingly autonomous and accessible quantum chemistry.

摘要

计算化学工具被广泛用于研究化学现象的行为特征。然而,这些工具的复杂性使得非专业人士难以使用,甚至对专家也构成挑战。本研究推出El Agente Q——一个基于大语言模型的多智能体系统,能够根据自然语言用户指令动态生成并执行量子化学工作流程。该系统采用新型认知架构,其层级化记忆框架支持灵活的任务分解、自适应工具选择、后分析处理以及自主文件管理与提交。通过对六项大学课程习题和两个案例研究的基准测试,El Agente Q展现出强大的问题解决能力(平均任务成功率>87%),并能通过原位调试实现自适应错误处理。该系统还支持更复杂工作流程的多步骤长期任务执行,同时通过详细动作追踪日志保持透明度。这些能力共同为日益自主化、平民化的量子化学研究奠定了基础。


Beyond the model: Key differentiators in large language models and multi-agent services

Abstract

arXiv:2505.02489v1 Announce Type: new Abstract: With the launch of foundation models like DeepSeek, Manus AI, and Llama 4, it has become evident that large language models (LLMs) are no longer the sole defining factor in generative AI. As many now operate at comparable levels of capability, the real race is not about having the biggest model but optimizing the surrounding ecosystem, including data quality and management, computational efficiency, latency, and evaluation frameworks. This review article delves into these critical differentiators that ensure modern AI services are efficient and profitable.

摘要

随着DeepSeek、Manus AI和Llama 4等基础模型的发布,大型语言模型(LLMs)已不再是生成式AI的唯一决定性因素。由于当前许多模型已具备相当的能力水平,真正的竞争焦点并非构建最大规模的模型,而是优化包括数据质量与管理、计算效率、延迟及评估框架在内的生态系统。本文综述了这些确保现代AI服务高效性与盈利性的关键差异化要素。


Large Language Model Partitioning for Low-Latency Inference at the Edge

Abstract

arXiv:2505.02533v1 Announce Type: new Abstract: Large Language Models (LLMs) based on autoregressive, decoder-only Transformers generate text one token at a time, where a token represents a discrete unit of text. As each newly produced token is appended to the partial output sequence, the length grows and so does the memory and compute load, due to the expanding key-value caches, which store intermediate representations of all previously generated tokens in the multi-head attention (MHA) layer. As this iterative process steadily increases memory and compute demands, layer-based partitioning in resource-constrained edge environments often results in memory overload or high inference latency. To address this and reduce inference latency, we propose a resource-aware Transformer architecture partitioning algorithm, where the partitioning decision is updated at regular intervals during token generation. The approach is myopic in that it is based on instantaneous information about device resource availability and network link bandwidths. When first executed, the algorithm places blocks on devices, and in later executions, it migrates these blocks among devices so that the sum of migration delay and inference delay remains low. Our approach partitions the decoder at the attention head level, co-locating each attention head with its key-value cache and allowing dynamic migrations whenever resources become tight. By allocating different attention heads to different devices, we exploit parallel execution of attention heads and thus achieve substantial reductions in inference delays. Our experiments show that in small-scale settings (3-5 devices), the proposed method achieves within 15 to 20 percent of an exact optimal solver's latency, while in larger-scale tests it achieves notable improvements in inference speed and memory usage compared to state-of-the-art layer-based partitioning approaches.

摘要

基于自回归解码器架构的Transformer大语言模型(LLMs)以离散文本单元(token)为粒度逐次生成文本。随着新生成token不断追加到部分输出序列中,由于多头注意力层(MHA)需要存储所有已生成token的中间表示(键值缓存),序列长度增长导致内存和计算负载持续增加。这种迭代过程会不断推高内存与计算需求,在资源受限的边缘计算环境中,基于层的模型分区方案常引发内存过载或高推理延迟。为降低推理延迟,我们提出一种资源感知的Transformer架构分区算法,该算法在token生成过程中定期更新分区决策。该方法具有短视特性,其决策依据设备实时资源可用性和网络链路带宽的瞬时信息:首次执行时在设备上分配模型块,后续执行时通过跨设备迁移模型块来保持迁移延迟与推理延迟之和最小。我们的方案在注意力头粒度进行解码器分区,将每个注意力头与其键值缓存共同部署,并在资源紧张时触发动态迁移。通过将不同注意力头分配至不同设备,我们实现了注意力头的并行执行,从而显著降低推理延迟。实验表明:在小规模场景(3-5台设备)中,本方法能达到精确最优求解器15%-20%延迟范围内的性能;在大规模测试中,相比最先进的基于层的分区方法,本方案在推理速度和内存使用方面均取得显著提升。


Recursive Decomposition with Dependencies for Generic Divide-and-Conquer Reasoning

Abstract

arXiv:2505.02576v1 Announce Type: new Abstract: Reasoning tasks are crucial in many domains, especially in science and engineering. Although large language models (LLMs) have made progress in reasoning tasks using techniques such as chain-of-thought and least-to-most prompting, these approaches still do not effectively scale to complex problems in either their performance or execution time. Moreover, they often require additional supervision for each new task, such as in-context examples. In this work, we introduce Recursive Decomposition with Dependencies (RDD), a scalable divide-and-conquer method for solving reasoning problems that requires less supervision than prior approaches. Our method can be directly applied to a new problem class even in the absence of any task-specific guidance. Furthermore, RDD supports sub-task dependencies, allowing for ordered execution of sub-tasks, as well as an error recovery mechanism that can correct mistakes made in previous steps. We evaluate our approach on two benchmarks with six difficulty levels each and in two in-context settings: one with task-specific examples and one without. Our results demonstrate that RDD outperforms other methods in a compute-matched setting as task complexity increases, while also being more computationally efficient.

摘要

推理任务在诸多领域尤其是科学与工程中至关重要。尽管大语言模型(LLMs)通过思维链、最少到最多提示等技术在推理任务上取得进展,这些方法在性能或执行时间上仍难以有效扩展到复杂问题。此外,它们通常需要为每个新任务提供额外监督(例如上下文示例)。本研究提出带依赖的递归分解(RDD)——一种可扩展的分治方法,其所需的监督少于现有方案。即使缺乏针对特定任务的指导,该方法也能直接应用于新问题类别。RDD还支持子任务依赖关系,允许有序执行子任务,并具备错误恢复机制以修正先前步骤的错误。我们在两个各含六个难度等级的基准测试中评估该方法,采用两种上下文设置:含任务特定示例与不含示例。结果表明,随着任务复杂性增加,RDD在计算资源匹配的设置中优于其他方法,同时具备更高计算效率。


A Survey of Slow Thinking-based Reasoning LLMs using Reinforced Learning and Inference-time Scaling Law

Abstract

arXiv:2505.02665v1 Announce Type: new Abstract: This survey explores recent advancements in reasoning large language models (LLMs) designed to mimic "slow thinking" - a reasoning process inspired by human cognition, as described in Kahneman's Thinking, Fast and Slow. These models, like OpenAI's o1, focus on scaling computational resources dynamically during complex tasks, such as math reasoning, visual reasoning, medical diagnosis, and multi-agent debates. We present the development of reasoning LLMs and list their key technologies. By synthesizing over 100 studies, it charts a path toward LLMs that combine human-like deep thinking with scalable efficiency for reasoning. The review breaks down methods into three categories: (1) test-time scaling dynamically adjusts computation based on task complexity via search and sampling, dynamic verification; (2) reinforced learning refines decision-making through iterative improvement leveraging policy networks, reward models, and self-evolution strategies; and (3) slow-thinking frameworks (e.g., long CoT, hierarchical processes) that structure problem-solving with manageable steps. The survey highlights the challenges and further directions of this domain. Understanding and advancing the reasoning abilities of LLMs is crucial for unlocking their full potential in real-world applications, from scientific discovery to decision support systems.

摘要

本综述探讨了旨在模拟'慢思考'(源自卡尼曼《思考,快与慢》中描述的人类认知推理过程)的推理大语言模型(LLMs)的最新进展。这类模型(如OpenAI的o1)通过在数学推理、视觉推理、医疗诊断和多智能体辩论等复杂任务中动态扩展计算资源来实现该目标。我们系统梳理了推理LLMs的发展脉络,并列举其关键技术。通过综合分析100余项研究,本文为兼具类人深度思考能力与可扩展推理效率的LLMs指明了发展路径。现有方法可分为三类:(1) 测试时动态扩展:通过搜索采样、动态验证等方式根据任务复杂度调整计算量;(2) 强化学习:利用策略网络、奖励模型及自我进化策略实现决策迭代优化;(3) 慢思考框架(如长思维链、分层处理):通过可管理的步骤结构化解决问题。研究同时指出了该领域面临的挑战与未来方向。理解并提升LLMs的推理能力,对于释放其在从科学发现到决策支持系统等现实应用中的全部潜能具有关键意义。


Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

Abstract

arXiv:2505.02707v1 Announce Type: new Abstract: A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, reason, and respond proactively, fostering fluid, dynamic, and emotionally resonant interactions. We introduce Voila, a family of large voice-language foundation models that make a step towards this vision. Voila moves beyond traditional pipeline systems by adopting a new end-to-end architecture that enables full-duplex, low-latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion. It achieves a response latency of just 195 milliseconds, surpassing the average human response time. Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models (LLMs) with powerful acoustic modeling, enabling natural, persona-aware voice generation -- where users can simply write text instructions to define the speaker's identity, tone, and other characteristics. Moreover, Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds. Beyond spoken dialogue, Voila is designed as a unified model for a wide range of voice-based applications, including automatic speech recognition (ASR), Text-to-Speech (TTS), and, with minimal adaptation, multilingual speech translation. Voila is fully open-sourced to support open research and accelerate progress toward next-generation human-machine interactions.

摘要

一个能无缝融入日常生活的语音AI代理,将以自主、实时且富有情感表达的方式与人类互动。它不仅能响应指令,更能持续聆听、推理并主动回应,促成流畅、动态且情感共鸣的交互。我们推出Voila系列大规模语音-语言基础模型,向这一愿景迈出重要一步。Voila突破传统流水线系统,采用新型端到端架构,在保留音调、节奏和情感等丰富声音细节的同时,实现全双工低延迟对话,响应延迟仅195毫秒,超越人类平均反应时间。其分层多尺度Transformer架构融合了大语言模型(LLMs)的推理能力与强大声学建模技术,支持自然且具备角色意识的语音生成——用户仅需通过文本指令即可定义说话者身份、语调等特征。此外,Voila支持超百万种预制声音,并能基于短至10秒的音频样本高效定制新声音。除口语对话外,Voila被设计为统一的多功能模型,适用于自动语音识别(ASR)、文本转语音(TTS)等广泛语音应用,经简单适配还可实现多语种语音翻译。Voila已全面开源以支持开放研究,加速下一代人机交互的发展。


Technical Report: Evaluating Goal Drift in Language Model Agents

Abstract

arXiv:2505.02709v1 Announce Type: new Abstract: As language models (LMs) are increasingly deployed as autonomous agents, their robust adherence to human-assigned objectives becomes crucial for safe operation. When these agents operate independently for extended periods without human oversight, even initially well-specified goals may gradually shift. Detecting and measuring goal drift - an agent's tendency to deviate from its original objective over time - presents significant challenges, as goals can shift gradually, causing only subtle behavioral changes. This paper proposes a novel approach to analyzing goal drift in LM agents. In our experiments, agents are first explicitly given a goal through their system prompt, then exposed to competing objectives through environmental pressures. We demonstrate that while the best-performing agent (a scaffolded version of Claude 3.5 Sonnet) maintains nearly perfect goal adherence for more than 100,000 tokens in our most difficult evaluation setting, all evaluated models exhibit some degree of goal drift. We also find that goal drift correlates with models' increasing susceptibility to pattern-matching behaviors as the context length grows.

摘要

随着语言模型(LMs)越来越多地被部署为自主智能体,其对人类设定目标的稳健遵循对安全运行至关重要。当这些智能体在无人监督的情况下长期独立运行时,即使最初明确指定的目标也可能逐渐发生偏移。检测和衡量目标漂移(即智能体随时间推移偏离原始目标的倾向)存在重大挑战,因为目标可能逐渐变化,仅导致细微的行为改变。本文提出了一种分析语言模型智能体目标漂移的新方法。实验中,我们首先通过系统提示明确赋予智能体目标,随后通过环境压力使其暴露于竞争性目标。研究表明,在最严苛的评估设置下,性能最佳的智能体(基于Claude 3.5 Sonnet的支架版本)能在超过10万标记的范围内近乎完美地保持目标遵循,但所有被评估模型均表现出不同程度的目标漂移。我们还发现,随着上下文长度增加,目标漂移与模型对模式匹配行为的敏感性增强存在相关性。


Enhancing LLMs' Clinical Reasoning with Real-World Data from a Nationwide Sepsis Registry

Abstract

arXiv:2505.02722v1 Announce Type: new Abstract: Although large language models (LLMs) have demonstrated impressive reasoning capabilities across general domains, their effectiveness in real-world clinical practice remains limited. This is likely due to their insufficient exposure to real-world clinical data during training, as such data is typically not included due to privacy concerns. To address this, we propose enhancing the clinical reasoning capabilities of LLMs by leveraging real-world clinical data. We constructed reasoning-intensive questions from a nationwide sepsis registry and fine-tuned Phi-4 on these questions using reinforcement learning, resulting in C-Reason. C-Reason exhibited strong clinical reasoning capabilities on the in-domain test set, as evidenced by both quantitative metrics and expert evaluations. Furthermore, its enhanced reasoning capabilities generalized to a sepsis dataset involving different tasks and patient cohorts, an open-ended consultations on antibiotics use task, and other diseases. Future research should focus on training LLMs with large-scale, multi-disease clinical datasets to develop more powerful, general-purpose clinical reasoning models.

摘要

尽管大语言模型(LLMs)在通用领域已展现出卓越的推理能力,但其在真实世界临床实践中的有效性仍显不足。这可能是由于训练过程中接触的真实临床数据有限——此类数据通常因隐私问题未被纳入。为解决该问题,我们提出通过利用真实临床数据来增强LLMs的临床推理能力。我们从全国性脓毒症注册库构建了推理密集型问题集,并采用强化学习对Phi-4模型进行微调,最终开发出C-Reason系统。定量指标与专家评估均证实,C-Reason在领域内测试集上表现出强大的临床推理能力。此外,其增强的推理能力可泛化至不同任务和患者群体的脓毒症数据集、抗生素使用开放式咨询任务以及其他疾病领域。未来研究应聚焦于利用大规模多疾病临床数据集训练LLMs,以开发更强大的通用临床推理模型。


FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models

Abstract

arXiv:2505.02735v1 Announce Type: new Abstract: Formal mathematical reasoning remains a critical challenge for artificial intelligence, hindered by limitations of existing benchmarks in scope and scale. To address this, we present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems across diverse domains (e.g., algebra, applied mathematics, calculus, number theory, and discrete mathematics). To mitigate the inefficiency of manual formalization, we introduce a novel human-in-the-loop autoformalization pipeline that integrates: (1) specialized large language models (LLMs) for statement autoformalization, (2) multi-LLM semantic verification, and (3) negation-based disproof filtering strategies using off-the-shelf LLM-based provers. This approach reduces expert annotation costs by retaining 72.09% of statements before manual verification while ensuring fidelity to the original natural-language problems. Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations: even the strongest models achieve only 16.46% success rate under practical sampling budgets, exhibiting pronounced domain bias (e.g., excelling in algebra but failing in calculus) and over-reliance on simplified automation tactics. Notably, we identify a counterintuitive inverse relationship between natural-language solution guidance and proof success in chain-of-thought reasoning scenarios, suggesting that human-written informal reasoning introduces noise rather than clarity in the formal reasoning settings. We believe that FormalMATH provides a robust benchmark for benchmarking formal mathematical reasoning.

摘要

形式化数学推理仍是人工智能面临的关键挑战,现有基准在广度和规模上的局限阻碍了相关进展。为此,我们提出FormalMATH——一个基于Lean4的大规模基准测试集,包含5,560个经过形式化验证的问题,涵盖从高中数学奥林匹克竞赛到本科阶段跨多领域(如代数、应用数学、微积分、数论和离散数学)的定理。为降低人工形式化的低效性,我们开发了一种新型人机协同自动形式化流程,整合了:(1)专用于命题自动形式化的大语言模型(LLMs);(2)多LLM语义验证机制;(3)基于否证的反例过滤策略(利用现成LLM证明器)。该方法在保持原始自然语言问题保真度的前提下,通过人工验证前保留72.09%的命题,显著降低了专家标注成本。对前沿LLM定理证明器的评估揭示了重大局限:即使在实用采样预算下,最强模型的成功率仅达16.46%,并表现出明显领域偏差(如擅长代数但拙于微积分)及对简化自动化策略的过度依赖。值得注意的是,我们发现思维链推理场景中存在反直觉现象:自然语言解题指导与证明成功率呈负相关,表明人类撰写的非形式化推理在形式化推理环境中反而引入了噪声而非清晰性。我们相信FormalMATH能为形式化数学推理研究提供强有力的基准支撑。


Giving Simulated Cells a Voice: Evolving Prompt-to-Intervention Models for Cellular Control

Abstract

arXiv:2505.02766v1 Announce Type: new Abstract: Guiding biological systems toward desired states, such as morphogenetic outcomes, remains a fundamental challenge with far-reaching implications for medicine and synthetic biology. While large language models (LLMs) have enabled natural language as an interface for interpretable control in AI systems, their use as mediators for steering biological or cellular dynamics remains largely unexplored. In this work, we present a functional pipeline that translates natural language prompts into spatial vector fields capable of directing simulated cellular collectives. Our approach combines a large language model with an evolvable neural controller (Prompt-to-Intervention, or P2I), optimized via evolutionary strategies to generate behaviors such as clustering or scattering in a simulated 2D environment. We demonstrate that even with constrained vocabulary and simplified cell models, evolved P2I networks can successfully align cellular dynamics with user-defined goals expressed in plain language. This work offers a complete loop from language input to simulated bioelectric-like intervention to behavioral output, providing a foundation for future systems capable of natural language-driven cellular control.

摘要

引导生物系统实现预期状态(如形态发生结果)仍是基础性挑战,对医学和合成生物学具有深远意义。尽管大语言模型(LLMs)已使自然语言成为AI系统中可解释控制的接口,但其作为调控生物或细胞动力学中介的应用仍待探索。本研究提出一个功能性流程,将自然语言提示转化为能指导模拟细胞群体的空间矢量场。该方法将大语言模型与可进化神经控制器(Prompt-to-Intervention,简称P2I)相结合,通过进化策略优化以在模拟2D环境中生成聚集或分散等行为。实验表明,即使使用受限词汇和简化细胞模型,进化后的P2I网络仍能成功使细胞动力学与用户用自然语言定义的目标保持一致。该研究实现了从语言输入到模拟类生物电干预再到行为输出的完整闭环,为未来实现自然语言驱动的细胞控制系统奠定了基础。


Knowing You Don't Know: Learning When to Continue Search in Multi-round RAG through Self-Practicing

Abstract

arXiv:2505.02811v1 Announce Type: new Abstract: Retrieval Augmented Generation (RAG) has shown strong capability in enhancing language models' knowledge and reducing AI generative hallucinations, driving its widespread use. However, complex tasks requiring multi-round retrieval remain challenging, and early attempts tend to be overly optimistic without a good sense of self-skepticism. Current multi-round RAG systems may continue searching even when enough information has already been retrieved, or they may provide incorrect answers without having sufficient information or knowledge. Existing solutions either require large amounts of expensive human-labeled process supervision data or lead to subpar performance. This paper aims to address these limitations by introducing a new framework, \textbf{SIM-RAG}, to explicitly enhance RAG systems' self-awareness and multi-round retrieval capabilities. To train SIM-RAG, we first let a RAG system self-practice multi-round retrieval, augmenting existing question-answer pairs with intermediate inner monologue reasoning steps to generate synthetic training data. For each pair, the system may explore multiple retrieval paths, which are labeled as successful if they reach the correct answer and unsuccessful otherwise. Using this data, we train a lightweight information sufficiency Critic. At inference time, the Critic evaluates whether the RAG system has retrieved sufficient information at each round, guiding retrieval decisions and improving system-level self-awareness through in-context reinforcement learning. Experiments across multiple prominent RAG benchmarks show that SIM-RAG is an effective multi-round RAG solution. Furthermore, this framework is system-efficient, adding a lightweight component to RAG without requiring modifications to existing LLMs or search engines, and data-efficient, eliminating the need for costly human-annotated mid-step retrieval process supervision data.

摘要

检索增强生成(RAG)技术在提升语言模型知识储备、减少AI生成幻觉方面展现出强大能力,因而获得广泛应用。然而,需要多轮检索的复杂任务仍具挑战性,早期尝试往往因缺乏自我质疑意识而过于乐观。当前多轮RAG系统可能在已获取足够信息时仍持续搜索,或在信息不足时提供错误答案。现有解决方案要么需要大量昂贵的人工标注流程监督数据,要么导致性能欠佳。

本文提出新框架SIM-RAG,旨在通过显式增强RAG系统的自我认知和多轮检索能力来解决这些局限。为训练SIM-RAG,我们首先让RAG系统自主进行多轮检索实践,通过添加中间内心独白式推理步骤来扩展现有问答对,从而生成合成训练数据。对于每对问答,系统可能探索多条检索路径——成功抵达正确答案的路径被标记为成功,反之为失败。利用这些数据,我们训练了一个轻量级信息充分性评判器(Critic)。在推理阶段,该评判器通过上下文强化学习评估RAG系统每轮是否已检索到充分信息,从而指导检索决策并提升系统级自我认知。

在多个知名RAG基准测试上的实验表明,SIM-RAG是一种有效的多轮RAG解决方案。该框架具有系统高效性——仅需为RAG添加轻量级组件而无需修改现有大语言模型或搜索引擎,同时具备数据高效性——无需昂贵的人工标注中间步骤检索流程监督数据。


AutoLibra: Agent Metric Induction from Open-Ended Feedback

Abstract

arXiv:2505.02820v1 Announce Type: new Abstract: Agents are predominantly evaluated and optimized via task success metrics, which are coarse, rely on manual design from experts, and fail to reward intermediate emergent behaviors. We propose AutoLibra, a framework for agent evaluation, that transforms open-ended human feedback, e.g., "If you find that the button is disabled, don't click it again", or "This agent has too much autonomy to decide what to do on its own", into metrics for evaluating fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding feedback to an agent's behavior, clustering similar positive and negative behaviors, and creating concrete metrics with clear definitions and concrete examples, which can be used for prompting LLM-as-a-Judge as evaluators. We further propose two meta-metrics to evaluate the alignment of a set of (induced) metrics with open feedback: "coverage" and "redundancy". Through optimizing these meta-metrics, we experimentally demonstrate AutoLibra's ability to induce more concrete agent evaluation metrics than the ones proposed in previous agent evaluation benchmarks and discover new metrics to analyze agents. We also present two applications of AutoLibra in agent improvement: First, we show that AutoLibra-induced metrics serve as better prompt-engineering targets than the task success rate on a wide range of text game tasks, improving agent performance over baseline by a mean of 20%. Second, we show that AutoLibra can iteratively select high-quality fine-tuning data for web navigation agents. Our results suggest that AutoLibra is a powerful task-agnostic tool for evaluating and improving language agents.

摘要

智能体主要通过任务成功率指标进行评估和优化,这类指标存在粒度粗糙、依赖专家人工设计且无法奖励中间涌现行为的问题。我们提出AutoLibra评估框架,能将开放式人类反馈(如"发现按钮禁用时不应重复点击"或"该智能体自主决策权过高")转化为细粒度行为评估指标。该框架通过将反馈锚定至智能体行为、聚类相似正负行为,并创建具有明确定义和具体实例的评估指标(可用于提示LLM-as-a-Judge评估器)来实现这一目标。我们进一步提出两个元指标来评估(诱导)指标集与开放反馈的匹配度:"覆盖率"和"冗余度"。通过优化这些元指标,实验证明AutoLibra能比现有评估基准产生更具体的智能体评估指标,并发现新的分析维度。我们还展示了AutoLibra在智能体改进中的两项应用:首先,在多种文本游戏任务中,AutoLibra诱导的指标作为提示工程目标优于任务成功率,使智能体性能平均提升20%;其次,该框架能迭代筛选网页导航智能体的高质量微调数据。结果表明AutoLibra是评估和改进语言智能体的强大任务无关工具。


Building Scalable AI-Powered Applications with Cloud Databases: Architectures, Best Practices and Performance Considerations

Abstract

arXiv:2504.18793v1 Announce Type: cross Abstract: The rapid adoption of AI-powered applications demands high-performance, scalable, and efficient cloud database solutions, as traditional architectures often struggle with AI-driven workloads requiring real-time data access, vector search, and low-latency queries. This paper explores how cloud-native databases enable AI-driven applications by leveraging purpose-built technologies such as vector databases (pgvector), graph databases (AWS Neptune), NoSQL stores (Amazon DocumentDB, DynamoDB), and relational cloud databases (Aurora MySQL and PostgreSQL). It presents architectural patterns for integrating AI workloads with cloud databases, including Retrieval-Augmented Generation (RAG) [1] with LLMs, real-time data pipelines, AI-driven query optimization, and embeddings-based search. Performance benchmarks, scalability considerations, and cost-efficient strategies are evaluated to guide the design of AI-enabled applications. Real-world case studies from industries such as healthcare, finance, and customer experience illustrate how enterprises utilize cloud databases to enhance AI capabilities while ensuring security, governance, and compliance with enterprise and regulatory standards. By providing a comprehensive analysis of AI and cloud database integration, this paper serves as a practical guide for researchers, architects, and enterprises to build next-generation AI applications that optimize performance, scalability, and cost efficiency in cloud environments.

摘要

人工智能应用的快速普及对高性能、可扩展且高效的云数据库解决方案提出了迫切需求,传统架构往往难以应对需要实时数据访问、向量搜索和低延迟查询的AI驱动型工作负载。本文探讨云原生数据库如何通过专用技术栈(包括向量数据库pgvector、图数据库AWS Neptune、NoSQL存储Amazon DocumentDB/DynamoDB以及关系型云数据库Aurora MySQL/PostgreSQL)赋能AI驱动型应用,提出AI工作负载与云数据库集成的架构模式,涵盖与大语言模型结合的检索增强生成技术(RAG)[1]、实时数据管道、AI驱动的查询优化及基于嵌入向量的搜索。通过性能基准测试、可扩展性评估和成本优化策略分析,为AI应用设计提供指导。来自医疗、金融和客户体验等行业的实际案例表明,企业如何利用云数据库在确保安全性、治理能力及符合企业/监管标准的前提下提升AI能力。本文通过对AI与云数据库融合的全面分析,为研究人员、架构师和企业构建新一代AI应用提供实践指南,助力实现云环境中性能、可扩展性与成本效益的最优平衡。


Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation

Abstract

arXiv:2505.01456v1 Announce Type: cross Abstract: LLMs trained on massive datasets may inadvertently acquire sensitive information such as personal details and potentially harmful content. This risk is further heightened in multimodal LLMs as they integrate information from multiple modalities (image and text). Adversaries can exploit this knowledge through multimodal prompts to extract sensitive details. Evaluating how effectively MLLMs can forget such information (targeted unlearning) necessitates the creation of high-quality, well-annotated image-text pairs. While prior work on unlearning has focused on text, multimodal unlearning remains underexplored. To address this gap, we first introduce a multimodal unlearning benchmark, UnLOK-VQA (Unlearning Outside Knowledge VQA), as well as an attack-and-defense framework to evaluate methods for deleting specific multimodal knowledge from MLLMs. We extend a visual question-answering dataset using an automated pipeline that generates varying-proximity samples for testing generalization and specificity, followed by manual filtering for maintaining high quality. We then evaluate six defense objectives against seven attacks (four whitebox, three blackbox), including a novel whitebox method leveraging interpretability of hidden states. Our results show multimodal attacks outperform text- or image-only ones, and that the most effective defense removes answer information from internal model states. Additionally, larger models exhibit greater post-editing robustness, suggesting that scale enhances safety. UnLOK-VQA provides a rigorous benchmark for advancing unlearning in MLLMs.

摘要

基于海量数据训练的LLM可能无意中习得敏感信息(如个人详情和潜在有害内容)。多模态LLM由于整合了图像与文本等多模态信息,这一风险进一步加剧。攻击者可通过多模态提示利用此类知识提取敏感细节。评估MLLM针对性遗忘此类信息(定向反学习)的效果,需要创建高质量、标注完善的图文对。尽管现有反学习研究集中于文本领域,多模态反学习仍待探索。为此,我们首先提出多模态反学习基准UnLOK-VQA(反学习外部知识视觉问答),以及用于评估从MLLM删除特定多模态知识方法的攻防框架。我们采用自动化流程扩展视觉问答数据集,生成不同近似度的样本来测试泛化性与特异性,并通过人工过滤保持高质量。随后针对七种攻击方式(四种白盒、三种黑盒,包括利用隐藏状态可解释性的新型白盒方法)评估六种防御目标。结果表明:多模态攻击效果优于纯文本或图像攻击;最有效防御方案是从模型内部状态移除答案信息。此外,更大模型展现出更强的编辑后鲁棒性,表明模型规模可提升安全性。UnLOK-VQA为推进MLLM反学习研究提供了严谨基准。


MoxE: Mixture of xLSTM Experts with Entropy-Aware Routing for Efficient Language Modeling

Abstract

arXiv:2505.01459v1 Announce Type: cross Abstract: This paper introduces MoxE, a novel architecture that synergistically combines the Extended Long Short-Term Memory (xLSTM) with the Mixture of Experts (MoE) framework to address critical scalability and efficiency challenges in large language models (LLMs). The proposed method effectively leverages xLSTM's innovative memory structures while strategically introducing sparsity through MoE to substantially reduce computational overhead. At the heart of our approach is a novel entropy-based routing mechanism, designed to dynamically route tokens to specialized experts, thereby ensuring efficient and balanced resource utilization. This entropy awareness enables the architecture to effectively manage both rare and common tokens, with mLSTM blocks being favored to handle rare tokens. To further enhance generalization, we introduce a suite of auxiliary losses, including entropy-based and group-wise balancing losses, ensuring robust performance and efficient training. Theoretical analysis and empirical evaluations rigorously demonstrate that MoxE achieves significant efficiency gains and enhanced effectiveness compared to existing approaches, marking a notable advancement in scalable LLM architectures.

摘要

本文提出MoxE——一种将扩展长短期记忆网络(xLSTM)与专家混合(MoE)框架协同整合的新型架构,旨在解决大语言模型(LLM)的可扩展性与效率关键挑战。该方法在有效利用xLSTM创新记忆结构的同时,通过MoE策略性引入稀疏性以显著降低计算开销。其核心是设计了一种基于熵的动态路由机制,可将标记智能分配至专用专家模块,从而确保资源的高效均衡利用。这种熵感知能力使架构能同时优化处理稀有与常见标记,其中mLSTM模块被优先用于处理稀有标记。为进一步增强泛化能力,我们引入包含基于熵的损失函数和分组平衡损失在内的辅助损失组合,以保障模型鲁棒性与训练效率。理论分析与实证评估充分表明,相比现有方法,MoxE在实现显著效率提升的同时具有更优的效能,标志着可扩展LLM架构的重要进展。


BiGSCoder: State Space Model for Code Understanding

Abstract

arXiv:2505.01475v1 Announce Type: cross Abstract: We present BiGSCoder, a novel encoder-only bidirectional state-space model (SSM) featuring a gated architecture, pre-trained for code understanding on a code dataset using masked language modeling. Our work aims to systematically evaluate SSMs' capabilities in coding tasks compared to traditional transformer architectures; BiGSCoder is built for this purpose. Through comprehensive experiments across diverse pre-training configurations and code understanding benchmarks, we demonstrate that BiGSCoder outperforms transformer-based models, despite utilizing simpler pre-training strategies and much less training data. Our results indicate that BiGSCoder can serve as a more sample-efficient alternative to conventional transformer models. Furthermore, our study shows that SSMs perform better without positional embeddings and can effectively extrapolate to longer sequences during fine-tuning.

摘要

我们提出BiGSCoder——一种新型仅编码器的双向状态空间模型(SSM),其采用门控架构,通过掩码语言建模在代码数据集上进行预训练以支持代码理解。本研究旨在系统评估SSMs在编码任务中相对于传统Transformer架构的性能优势,为此专门构建了BiGSCoder。通过在不同预训练配置和代码理解基准测试中的全面实验,我们证明尽管采用更简单的预训练策略和少得多的训练数据,BiGSCoder仍能超越基于Transformer的模型。结果表明,BiGSCoder可作为传统Transformer模型更具样本效率的替代方案。此外,研究发现SSMs在没有位置嵌入时表现更优,且能在微调阶段有效外推至更长序列。


Subset Selection for Fine-Tuning: A Utility-Diversity Balanced Approach for Mathematical Domain Adaptation

Abstract

arXiv:2505.01523v1 Announce Type: cross Abstract: We propose a refined approach to efficiently fine-tune large language models (LLMs) on specific domains like the mathematical domain by employing a budgeted subset selection method. Our approach combines utility and diversity metrics to select the most informative and representative training examples. The final goal is to achieve near-full dataset performance with meticulously selected data points from the entire dataset while significantly reducing computational cost and training time and achieving competitive performance as the full dataset. The utility metric incorporates both perplexity and Chain-of-Thought (CoT) loss to identify challenging examples that contribute most to model learning, while the diversity metric ensures broad coverage across mathematical subdomains. We evaluate our method on LLaMA-3 8B and Phi-3 models, comparing against several baseline approaches, including random selection, diversity-based sampling, and existing state-of-the-art subset selection techniques.

摘要

我们提出一种改进方法,通过采用预算约束的子集选择策略,在数学等特定领域高效微调大语言模型(LLMs)。该方法结合效用性与多样性指标,筛选最具信息量和代表性的训练样本。最终目标是通过从全量数据集中精选数据点,在显著降低计算成本和训练时间的同时,实现接近全数据集性能的竞争性表现。效用性指标综合了困惑度和思维链(CoT)损失,以识别对模型学习贡献最大的挑战性样本;而多样性指标则确保覆盖数学各子领域的广泛性。我们在LLaMA-3 8B和Phi-3模型上评估该方法,并与随机选择、基于多样性的采样及现有先进子集选择技术等基线方案进行对比。


Emotions in the Loop: A Survey of Affective Computing for Emotional Support

Abstract

arXiv:2505.01542v1 Announce Type: cross Abstract: In a world where technology is increasingly embedded in our everyday experiences, systems that sense and respond to human emotions are elevating digital interaction. At the intersection of artificial intelligence and human-computer interaction, affective computing is emerging with innovative solutions where machines are humanized by enabling them to process and respond to user emotions. This survey paper explores recent research contributions in affective computing applications in the area of emotion recognition, sentiment analysis and personality assignment developed using approaches like large language models (LLMs), multimodal techniques, and personalized AI systems. We analyze the key contributions and innovative methodologies applied by the selected research papers by categorizing them into four domains: AI chatbot applications, multimodal input systems, mental health and therapy applications, and affective computing for safety applications. We then highlight the technological strengths as well as the research gaps and challenges related to these studies. Furthermore, the paper examines the datasets used in each study, highlighting how modality, scale, and diversity impact the development and performance of affective models. Finally, the survey outlines ethical considerations and proposes future directions to develop applications that are more safe, empathetic and practical.

摘要

在技术日益融入日常体验的世界中,能够感知并响应人类情感的系统正在提升数字交互体验。作为人工智能与人机交互的交叉领域,情感计算通过使机器具备处理和响应用户情绪的能力,正以创新解决方案推动机器的人性化发展。本综述论文系统探究了情感计算在情绪识别、情感分析和性格推断等应用领域的最新研究成果,这些研究主要采用大语言模型(LLMs)、多模态技术和个性化AI系统等方法。我们通过将选定研究论文归类至四大应用领域——AI聊天机器人应用、多模态输入系统、心理健康治疗应用以及安全领域的情感计算,深入分析了其核心贡献与创新方法论。研究同时揭示了相关技术的优势以及存在的科研缺口与挑战。此外,本文详细考察了各研究采用的数据集,阐明了数据模态、规模及多样性对情感模型开发与性能的影响。最后,综述提出了伦理考量,并规划了未来发展方向,以推动构建更安全、更具同理心且实用的情感计算应用。


PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents

Abstract

arXiv:2505.01592v1 Announce Type: cross Abstract: The growing capabilities of large language models (LLMs) in instruction-following and context-understanding lead to the era of agents with numerous applications. Among these, task planning agents have become especially prominent in realistic scenarios involving complex internal pipelines, such as context understanding, tool management, and response generation. However, existing benchmarks predominantly evaluate agent performance based on task completion as a proxy for overall effectiveness. We hypothesize that merely improving task completion is misaligned with maximizing user satisfaction, as users interact with the entire agentic process and not only the end result. To address this gap, we propose PIPA, a unified evaluation protocol that conceptualizes the behavioral process of interactive task planning agents within a partially observable Markov Decision Process (POMDP) paradigm. The proposed protocol offers a comprehensive assessment of agent performance through a set of atomic evaluation criteria, allowing researchers and practitioners to diagnose specific strengths and weaknesses within the agent's decision-making pipeline. Our analyses show that agents excel in different behavioral stages, with user satisfaction shaped by both outcomes and intermediate behaviors. We also highlight future directions, including systems that leverage multiple agents and the limitations of user simulators in task planning.

摘要

大型语言模型(LLMs)在指令遵循和上下文理解方面日益增强的能力,推动着智能代理时代的到来,并催生了众多应用场景。其中,任务规划代理在现实场景中尤为突出,这些场景通常涉及复杂的内部流程,如上下文理解、工具管理和响应生成。然而,现有基准测试主要基于任务完成度作为整体效能的代理指标进行评估。我们提出假设:仅提高任务完成度与最大化用户满意度并不一致,因为用户是与整个代理流程互动,而非仅关注最终结果。为填补这一空白,我们提出PIPA评估框架——该协议将交互式任务规划代理的行为过程概念化为部分可观测马尔可夫决策过程(POMDP)范式。通过一组原子化评估标准,该框架可对代理性能进行全面评估,使研究者和实践者能够诊断代理决策流程中的具体优势与缺陷。分析表明,不同代理在行为阶段各有所长,而用户满意度同时受结果和中间行为影响。我们还展望了未来方向,包括利用多代理系统的解决方案,并指出任务规划中用户模拟器的局限性。


Always Tell Me The Odds: Fine-grained Conditional Probability Estimation

Abstract

arXiv:2505.01595v1 Announce Type: cross Abstract: We present a state-of-the-art model for fine-grained probability estimation of propositions conditioned on context. Recent advances in large language models (LLMs) have significantly enhanced their reasoning capabilities, particularly on well-defined tasks with complete information. However, LLMs continue to struggle with making accurate and well-calibrated probabilistic predictions under uncertainty or partial information. While incorporating uncertainty into model predictions often boosts performance, obtaining reliable estimates of that uncertainty remains understudied. In particular, LLM probability estimates tend to be coarse and biased towards more frequent numbers. Through a combination of human and synthetic data creation and assessment, scaling to larger models, and better supervision, we propose a set of strong and precise probability estimation models. We conduct systematic evaluations across tasks that rely on conditional probability estimation and show that our approach consistently outperforms existing fine-tuned and prompting-based methods by a large margin.

摘要

我们提出了一种最先进的细粒度概率估计模型,用于在给定上下文条件下对命题进行概率评估。尽管大语言模型(LLMs)在推理能力方面取得了显著进展,特别是在信息完整的明确定义任务上表现优异,但其在不确定或部分信息条件下进行准确且校准良好的概率预测仍存在困难。虽然将不确定性纳入模型预测通常能提升性能,但如何获得可靠的不确定性估计仍未得到充分研究。具体而言,LLM的概率估计往往较为粗糙,且倾向于更常见的数值。通过结合人工与合成数据创建与评估、扩展至更大规模模型以及改进监督方法,我们提出了一组强大而精确的概率估计模型。我们在依赖条件概率估计的各项任务中进行了系统评估,结果表明:相较于现有基于微调和提示的方法,我们的方法始终以显著优势优于它们。


Don't be lazy: CompleteP enables compute-efficient deep transformers

Abstract

arXiv:2505.01618v1 Announce Type: cross Abstract: We study compute efficiency of LLM training when using different parameterizations, i.e., rules for adjusting model and optimizer hyperparameters (HPs) as model size changes. Some parameterizations fail to transfer optimal base HPs (such as learning rate) across changes in model depth, requiring practitioners to either re-tune these HPs as they scale up (expensive), or accept sub-optimal training when re-tuning is prohibitive. Even when they achieve HP transfer, we develop theory to show parameterizations may still exist in the lazy learning regime where layers learn only features close to their linearization, preventing effective use of depth and nonlinearity. Finally, we identify and adopt the unique parameterization we call CompleteP that achieves both depth-wise HP transfer and non-lazy learning in all layers. CompleteP enables a wider range of model width/depth ratios to remain compute-efficient, unlocking shapes better suited for different hardware settings and operational contexts. Moreover, CompleteP enables 12-34% compute efficiency improvements over the prior state-of-the-art.

摘要

我们研究了使用不同参数化方法(即随模型规模变化调整模型和优化器超参数的规则)时大语言模型训练的计算效率。某些参数化方法无法在模型深度变化时传递最优基础超参数(如学习率),迫使实践者要么在扩大规模时重新调整这些超参数(成本高昂),要么在无法重新调整时接受次优训练。即使实现了超参数传递,我们通过理论分析发现参数化方法仍可能处于惰性学习状态——各层仅学习接近其线性化的特征,从而无法有效利用深度和非线性。最终,我们确定并采用了一种称为CompleteP的独特参数化方法,该方法在所有网络层中同时实现了深度维度的超参数传递和非惰性学习。CompleteP使更广泛的模型宽度/深度比例保持计算高效,解锁了更适合不同硬件设置和操作环境的模型架构。此外,与现有最优方法相比,CompleteP实现了12-34%的计算效率提升。


A Domain Adaptation of Large Language Models for Classifying Mechanical Assembly Components

Abstract

arXiv:2505.01627v1 Announce Type: cross Abstract: The conceptual design phase represents a critical early stage in the product development process, where designers generate potential solutions that meet predefined design specifications based on functional requirements. Functional modeling, a foundational aspect of this phase, enables designers to reason about product functions before specific structural details are determined. A widely adopted approach to functional modeling is the Function-Behavior-Structure (FBS) framework, which supports the transformation of functional intent into behavioral and structural descriptions. However, the effectiveness of function-based design is often hindered by the lack of well-structured and comprehensive functional data. This scarcity can negatively impact early design decision-making and hinder the development of accurate behavioral models. Recent advances in Large Language Models (LLMs), such as those based on GPT architectures, offer a promising avenue to address this gap. LLMs have demonstrated significant capabilities in language understanding and natural language processing (NLP), making them suitable for automated classification tasks. This study proposes a novel LLM-based domain adaptation (DA) framework using fine-tuning for the automated classification of mechanical assembly parts' functions. By fine-tuning LLMs on domain-specific datasets, the traditionally manual and subjective process of function annotation can be improved in both accuracy and consistency. A case study demonstrates fine-tuning GPT-3.5 Turbo on data from the Oregon State Design Repository (OSDR), and evaluation on the A Big CAD (ABC) dataset shows that the domain-adapted LLM can generate high-quality functional data, enhancing the semantic representation of mechanical parts and supporting more effective design exploration in early-phase engineering.

摘要

概念设计阶段是产品开发过程中关键的早期阶段,设计师在此阶段根据功能需求生成符合预定设计规范的潜在解决方案。功能建模作为该阶段的基础环节,使设计师能够在确定具体结构细节前对产品功能进行推理论证。功能-行为-结构(FBS)框架是广泛采用的功能建模方法,支持将功能意图转化为行为与结构描述。然而,功能化设计的有效性常因缺乏结构良好且全面的功能数据而受限,这种数据匮乏会对早期设计决策产生负面影响,并阻碍精确行为模型的建立。基于GPT架构的大语言模型(LLMs)的最新进展为解决这一问题提供了新途径,其在语言理解与自然语言处理(NLP)方面展现的卓越能力,使其特别适用于自动化分类任务。本研究提出一种基于LLM的领域自适应(DA)新框架,通过微调实现机械装配零件功能的自动分类。在特定领域数据集上对LLM进行微调,可显著提升功能标注这一传统人工主观过程的准确性与一致性。案例研究展示了基于俄勒冈州立大学设计资源库(OSDR)数据对GPT-3.5 Turbo的微调过程,在ABC数据集上的评估表明,经领域自适应的大语言模型能生成高质量功能数据,从而增强机械零件的语义表征能力,为工程早期阶段更有效的设计探索提供支持。


RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation

Abstract

arXiv:2505.01709v1 Announce Type: cross Abstract: Operating robots in open-ended scenarios with diverse tasks is a crucial research and application direction in robotics. While recent progress in natural language processing and large multimodal models has enhanced robots' ability to understand complex instructions, robot manipulation still faces the procedural skill dilemma and the declarative skill dilemma in open environments. Existing methods often compromise cognitive and executive capabilities. To address these challenges, in this paper, we propose RoBridge, a hierarchical intelligent architecture for general robotic manipulation. It consists of a high-level cognitive planner (HCP) based on a large-scale pre-trained vision-language model (VLM), an invariant operable representation (IOR) serving as a symbolic bridge, and a generalist embodied agent (GEA). RoBridge maintains the declarative skill of VLM and unleashes the procedural skill of reinforcement learning, effectively bridging the gap between cognition and execution. RoBridge demonstrates significant performance improvements over existing baselines, achieving a 75% success rate on new tasks and an 83% average success rate in sim-to-real generalization using only five real-world data samples per task. This work represents a significant step towards integrating cognitive reasoning with physical execution in robotic systems, offering a new paradigm for general robotic manipulation.

摘要

在开放场景中操作机器人执行多样化任务是机器人技术的重要研究和应用方向。尽管自然语言处理和大规模多模态模型的进展提升了机器人理解复杂指令的能力,但开放环境下的机器人操作仍面临程序性技能困境与陈述性技能困境。现有方法往往需要折中认知与执行能力。针对这些挑战,本文提出RoBridge——一种通用机器人操作的分层智能架构,其由基于大规模预训练视觉语言模型(VLM)的高层认知规划器(HCP)、作为符号桥梁的不变可操作表征(IOR),以及通用具身智能体(GEA)构成。RoBridge既保持了VLM的陈述性技能,又释放了强化学习的程序性技能,有效弥合了认知与执行的鸿沟。实验表明,RoBridge相较现有基线模型取得显著性能提升,在新任务上达到75%成功率,在模拟到现实的泛化中仅需每任务5个真实世界数据样本即实现83%平均成功率。该工作标志着机器人系统认知推理与物理执行融合的重要进展,为通用机器人操作提供了新范式。


Efficient Shapley Value-based Non-Uniform Pruning of Large Language Models

Abstract

arXiv:2505.01731v1 Announce Type: cross Abstract: Pruning large language models (LLMs) is a promising solution for reducing model sizes and computational complexity while preserving performance. Traditional layer-wise pruning methods often adopt a uniform sparsity approach across all layers, which leads to suboptimal performance due to the varying significance of individual transformer layers within the model not being accounted for. To this end, we propose the \underline{S}hapley \underline{V}alue-based \underline{N}on-\underline{U}niform \underline{P}runing (\methodname{}) method for LLMs. This approach quantifies the contribution of each transformer layer to the overall model performance, enabling the assignment of tailored pruning budgets to different layers to retain critical parameters. To further improve efficiency, we design the Sliding Window-based Shapley Value approximation method. It substantially reduces computational overhead compared to exact SV calculation methods. Extensive experiments on various LLMs including LLaMA-v1, LLaMA-v2 and OPT demonstrate the effectiveness of the proposed approach. The results reveal that non-uniform pruning significantly enhances the performance of pruned models. Notably, \methodname{} achieves a reduction in perplexity (PPL) of 18.01% and 19.55% on LLaMA-7B and LLaMA-13B, respectively, compared to SparseGPT at 70% sparsity.

摘要

大语言模型(LLM)剪枝是一种在保持性能的同时减小模型规模和计算复杂度的有效方法。传统逐层剪枝方法通常对所有层采用统一的稀疏度策略,由于未考虑模型中各Transformer层的重要性差异,往往导致次优性能。为此,我们提出基于沙普利值的非均匀剪枝方法(\methodname{})。该方法量化每个Transformer层对整体模型性能的贡献度,从而为不同层分配定制化的剪枝预算以保留关键参数。为提升效率,我们进一步设计了基于滑动窗口的沙普利值近似计算方法,相比精确计算显著降低了计算开销。在LLaMA-v1、LLaMA-v2和OPT等多种大语言模型上的实验表明,该方法能有效提升剪枝后模型的性能。值得注意的是,在70%稀疏度下,相比SparseGPT方法,\methodname{}使LLaMA-7B和LLaMA-13B的困惑度(PPL)分别降低了18.01%和19.55%。


An LLM-Empowered Low-Resolution Vision System for On-Device Human Behavior Understanding

Abstract

arXiv:2505.01743v1 Announce Type: cross Abstract: The rapid advancements in Large Vision Language Models (LVLMs) offer the potential to surpass conventional labeling by generating richer, more detailed descriptions of on-device human behavior understanding (HBU) in low-resolution vision systems, such as depth, thermal, and infrared. However, existing large vision language model (LVLM) approaches are unable to understand low-resolution data well as they are primarily designed for high-resolution data, such as RGB images. A quick fixing approach is to caption a large amount of low-resolution data, but it requires a significant amount of labor-intensive annotation efforts. In this paper, we propose a novel, labor-saving system, Llambda, designed to support low-resolution HBU. The core idea is to leverage limited labeled data and a large amount of unlabeled data to guide LLMs in generating informative captions, which can be combined with raw data to effectively fine-tune LVLM models for understanding low-resolution videos in HBU. First, we propose a Contrastive-Oriented Data Labeler, which can capture behavior-relevant information from long, low-resolution videos and generate high-quality pseudo labels for unlabeled data via contrastive learning. Second, we propose a Physical-Knowledge Guided Captioner, which utilizes spatial and temporal consistency checks to mitigate errors in pseudo labels. Therefore, it can improve LLMs' understanding of sequential data and then generate high-quality video captions. Finally, to ensure on-device deployability, we employ LoRA-based efficient fine-tuning to adapt LVLMs for low-resolution data. We evaluate Llambda using a region-scale real-world testbed and three distinct low-resolution datasets, and the experiments show that Llambda outperforms several state-of-the-art LVLM systems up to 40.03%40.03\% on average Bert-Score.

摘要

大型视觉语言模型(LVLM)的快速发展为超越传统标注方法提供了可能,能够为低分辨率视觉系统(如深度、热成像和红外)中的设备端人类行为理解(HBU)生成更丰富、更细致的描述。然而,现有的大型视觉语言模型主要针对高分辨率数据(如RGB图像)设计,难以有效理解低分辨率数据。一种快速解决方案是对大量低分辨率数据进行标注,但这需要耗费大量人力密集型标注工作。本文提出了一种新型省力系统Llambda,旨在支持低分辨率HBU。其核心思想是利用有限标注数据和大量未标注数据引导大语言模型(LLM)生成信息丰富的描述文本,这些文本可与原始数据结合,有效微调LVLM模型以理解HBU中的低分辨率视频。首先,我们提出对比导向数据标注器,通过对比学习从长时低分辨率视频中捕获行为相关信息,并为未标注数据生成高质量伪标签。其次,我们提出物理知识引导的标注生成器,利用时空一致性检查来减少伪标签错误,从而提升LLM对序列数据的理解能力以生成高质量视频描述。最后,为确保设备端可部署性,我们采用基于LoRA的高效微调方法使LVLM适配低分辨率数据。通过在区域级真实测试平台和三个不同低分辨率数据集上的评估,实验表明Llambda在平均Bert-Score上最高优于现有最优LVLM系统达40.03%。


\textit{New News}: System-2 Fine-tuning for Robust Integration of New Knowledge

Abstract

arXiv:2505.01812v1 Announce Type: cross Abstract: Humans and intelligent animals can effortlessly internalize new information ("news") and accurately extract the implications for performing downstream tasks. While large language models (LLMs) can achieve this through in-context learning (ICL) when the news is explicitly given as context, fine-tuning remains challenging for the models to consolidate learning in weights. In this paper, we introduce \textit{New News}, a dataset composed of hypothetical yet plausible news spanning multiple domains (mathematics, coding, discoveries, leaderboards, events), accompanied by downstream evaluation questions whose correct answers critically depend on understanding and internalizing the news. We first demonstrate a substantial gap between naive fine-tuning and in-context learning (FT-ICL gap) on our news dataset. To address this gap, we explore a suite of self-play data generation protocols -- paraphrases, implications and Self-QAs -- designed to distill the knowledge from the model with context into the weights of the model without the context, which we term \textit{System-2 Fine-tuning} (Sys2-FT). We systematically evaluate ICL and Sys2-FT performance across data domains and model scales with the Qwen 2.5 family of models. Our results demonstrate that the self-QA protocol of Sys2-FT significantly improves models' in-weight learning of the news. Furthermore, we discover the \textit{contexual shadowing effect}, where training with the news \textit{in context} followed by its rephrases or QAs degrade learning of the news. Finally, we show preliminary evidence of an emerging scaling law of Sys2-FT.

摘要

人类与智能动物能够轻松内化新信息("新闻"),并准确提取其对执行下游任务的隐含影响。虽然大型语言模型(LLM)在新闻被明确作为上下文给出时,可以通过上下文学习(ICL)实现这一目标,但微调方法仍难以将学习成果巩固到模型权重中。本文提出《新新闻》数据集,该数据集包含跨多个领域(数学、编程、科学发现、排行榜、事件)的假设性但合理的新闻,并附有下游评估问题——这些问题的正确答案关键取决于对新闻的理解与内化。我们首先证明了在新闻数据集上,朴素微调与上下文学习之间存在显著差距(FT-ICL差距)。为弥补这一差距,我们探索了一套自博弈数据生成协议——包括转述、推衍和自问自答——旨在将模型在上下文中的知识蒸馏到无上下文情况下的模型权重中,该方法被我们称为"系统2微调"(Sys2-FT)。我们使用Qwen 2.5系列模型,系统评估了不同数据领域和模型规模下ICL与Sys2-FT的性能。结果表明,Sys2-FT的自问自答协议显著提升了模型对新闻的权重内学习能力。此外,我们发现了"语境遮蔽效应":当模型在上下文中学习新闻后,再接受其转述或问答训练时,会削弱对原始新闻的学习效果。最后,我们提供了Sys2-FT涌现出的规模定律的初步证据。


Intra-Layer Recurrence in Transformers for Language Modeling

Abstract

arXiv:2505.01855v1 Announce Type: cross Abstract: Transformer models have established new benchmarks in natural language processing; however, their increasing depth results in substantial growth in parameter counts. While existing recurrent transformer methods address this issue by reprocessing layers multiple times, they often apply recurrence indiscriminately across entire blocks of layers. In this work, we investigate Intra-Layer Recurrence (ILR), a more targeted approach that applies recurrence selectively to individual layers within a single forward pass. Our experiments show that allocating more iterations to earlier layers yields optimal results. These findings suggest that ILR offers a promising direction for optimizing recurrent structures in transformer architectures.

摘要

Transformer模型在自然语言处理领域确立了新的性能基准,但其不断增加的深度导致参数量急剧增长。现有循环Transformer方法通过多次重处理层块来解决这一问题,但往往不加区分地对整个层块应用循环机制。本研究提出层内循环(ILR)这一更具针对性的方法,该技术在前向传播过程中选择性地对单个层应用循环处理。实验表明,将更多迭代次数分配给早期层能获得最优结果。这些发现证明,ILR为优化Transformer架构中的循环结构提供了有前景的研究方向。


PhysNav-DG: A Novel Adaptive Framework for Robust VLM-Sensor Fusion in Navigation Applications

Abstract

arXiv:2505.01881v1 Announce Type: cross Abstract: Robust navigation in diverse environments and domains requires both accurate state estimation and transparent decision making. We present PhysNav-DG, a novel framework that integrates classical sensor fusion with the semantic power of vision-language models. Our dual-branch architecture predicts navigation actions from multi-sensor inputs while simultaneously generating detailed chain-of-thought explanations. A modified Adaptive Kalman Filter dynamically adjusts its noise parameters based on environmental context. It leverages several streams of raw sensor data along with semantic insights from models such as LLaMA 3.2 11B and BLIP-2. To evaluate our approach, we introduce the MD-NEX Benchmark, a novel multi-domain dataset that unifies indoor navigation, autonomous driving, and social navigation tasks with ground-truth actions and human-validated explanations. Extensive experiments and ablations show that PhysNav-DG improves navigation success rates by over 20% and achieves high efficiency, with explanations that are both highly grounded and clear. This work connects high-level semantic reasoning and geometric planning for safer and more trustworthy autonomous systems.

摘要

在多环境和多领域中实现鲁棒导航需要精确的状态估计和透明的决策过程。我们提出PhysNav-DG框架,该创新性方案将经典传感器融合与视觉语言模型的语义能力相结合。我们的双分支架构既能通过多传感器输入预测导航动作,又可同步生成详细的思维链解释。改进的自适应卡尔曼滤波器能根据环境上下文动态调整噪声参数,该框架整合了多种原始传感器数据流以及来自LLaMA 3.2 11B和BLIP-2等模型的语义洞察。为评估方法性能,我们构建了MD-NEX基准测试——这是一个新颖的多领域数据集,统一了包含真实动作标注和人工验证解释的室内导航、自动驾驶及社交导航任务。大量实验与消融研究表明,PhysNav-DG将导航成功率提升超过20%,在保持高效运行的同时,其生成的解释兼具高度可靠性与清晰性。本研究通过连接高层语义推理与几何规划,为构建更安全、更可信的自主系统提供了新途径。


LookAlike: Consistent Distractor Generation in Math MCQs

Abstract

arXiv:2505.01903v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly used to generate distractors for multiple-choice questions (MCQs), especially in domains like math education. However, existing approaches are limited in ensuring that the generated distractors are consistent with common student errors. We propose LookAlike, a method that improves error-distractor consistency via preference optimization. Our two main innovations are: (a) mining synthetic preference pairs from model inconsistencies, and (b) alternating supervised fine-tuning (SFT) with Direct Preference Optimization (DPO) to stabilize training. Unlike prior work that relies on heuristics or manually annotated preference data, LookAlike uses its own generation inconsistencies as dispreferred samples, thus enabling scalable and stable training. Evaluated on a real-world dataset of 1,400+ math MCQs, LookAlike achieves 51.6% accuracy in distractor generation and 57.2% in error generation under LLM-as-a-judge evaluation, outperforming an existing state-of-the-art method (45.6% / 47.7%). These improvements highlight the effectiveness of preference-based regularization and inconsistency mining for generating consistent math MCQ distractors at scale.

摘要

大型语言模型(LLMs)正被越来越多地用于为选择题(MCQs)生成干扰项,尤其在数学教育等领域。然而,现有方法难以确保生成的干扰项与学生的常见错误保持一致。我们提出LookAlike方法,通过偏好优化提升错误-干扰项一致性。其主要创新点在于:(a)从模型不一致性中挖掘合成偏好对;(b)交替使用监督微调(SFT)和直接偏好优化(DPO)以稳定训练。与依赖启发式规则或人工标注偏好数据的现有工作不同,LookAlike利用自身生成的不一致性作为非偏好样本,从而实现可扩展且稳定的训练。在包含1400多道数学选择题的真实数据集上评估时,LookAlife在LLM作为评判者的测试中分别达到51.6%的干扰项生成准确率和57.2%的错误生成准确率,优于现有最优方法(45.6%/47.7%)。这些改进凸显了基于偏好的正则化与不一致性挖掘对于大规模生成一致性数学选择题干扰项的有效性。


Semantic Intelligence: Integrating GPT-4 with A Planning in Low-Cost Robotics

Abstract

arXiv:2505.01931v1 Announce Type: cross Abstract: Classical robot navigation often relies on hardcoded state machines and purely geometric path planners, limiting a robot's ability to interpret high-level semantic instructions. In this paper, we first assess GPT-4's ability to act as a path planner compared to the A* algorithm, then present a hybrid planning framework that integrates GPT-4's semantic reasoning with A* on a low-cost robot platform operating on ROS2 Humble. Our approach eliminates explicit finite state machine (FSM) coding by using prompt-based GPT-4 reasoning to handle task logic while maintaining the accurate paths computed by A*. The GPT-4 module provides semantic understanding of instructions and environmental cues (e.g., recognizing toxic obstacles or crowded areas to avoid, or understanding low-battery situations requiring alternate route selection), and dynamically adjusts the robot's occupancy grid via obstacle buffering to enforce semantic constraints. We demonstrate multi-step reasoning for sequential tasks, such as first navigating to a resource goal and then reaching a final destination safely. Experiments on a Petoi Bittle robot with an overhead camera and Raspberry Pi Zero 2W compare classical A* against GPT-4-assisted planning. Results show that while A* is faster and more accurate for basic route generation and obstacle avoidance, the GPT-4-integrated system achieves high success rates (96-100%) on semantic tasks that are infeasible for pure geometric planners. This work highlights how affordable robots can exhibit intelligent, context-aware behaviors by leveraging large language model reasoning with minimal hardware and no fine-tuning.

摘要

传统机器人导航通常依赖于硬编码状态机和纯几何路径规划器,限制了机器人理解高层语义指令的能力。本文首先评估GPT-4作为路径规划器与A算法的性能差异,随后提出一种混合规划框架,该框架在基于ROS2 Humble的低成本机器人平台上将GPT-4的语义推理能力与A算法相结合。我们的方法通过基于提示的GPT-4推理处理任务逻辑,同时保留A算法计算的精确路径,从而消除了显式有限状态机(FSM)编码需求。GPT-4模块提供对指令和环境线索的语义理解(例如识别需避开的毒性障碍物或拥挤区域,或理解需要选择替代路线的低电量情况),并通过障碍物缓冲动态调整机器人占据栅格以强化语义约束。我们展示了多步骤顺序任务的推理能力,例如先导航至资源目标再安全抵达最终目的地。在配备顶置摄像头和树莓派Zero 2W的Petoi Bittle机器人上进行的实验对比了传统A与GPT-4辅助规划方案。结果表明:虽然A*在基本路径生成和避障方面速度更快、精度更高,但集成GPT-4的系统在纯几何规划器无法实现的语义任务上取得了96-100%的高成功率。本研究证明,通过结合大语言模型推理能力,低成本机器人在无需硬件升级和微调的条件下即可展现出智能化的情境感知行为。


Analyzing Cognitive Differences Among Large Language Models through the Lens of Social Worldview

Abstract

arXiv:2505.01967v1 Announce Type: cross Abstract: Large Language Models (LLMs) have become integral to daily life, widely adopted in communication, decision-making, and information retrieval, raising critical questions about how these systems implicitly form and express socio-cognitive attitudes or "worldviews". While existing research extensively addresses demographic and ethical biases, broader dimensions-such as attitudes toward authority, equality, autonomy, and fate-remain under-explored. In this paper, we introduce the Social Worldview Taxonomy (SWT), a structured framework grounded in Cultural Theory, operationalizing four canonical worldviews (Hierarchy, Egalitarianism, Individualism, Fatalism) into measurable sub-dimensions. Using SWT, we empirically identify distinct and interpretable cognitive profiles across 28 diverse LLMs. Further, inspired by Social Referencing Theory, we experimentally demonstrate that explicit social cues systematically shape these cognitive attitudes, revealing both general response patterns and nuanced model-specific variations. Our findings enhance the interpretability of LLMs by revealing implicit socio-cognitive biases and their responsiveness to social feedback, thus guiding the development of more transparent and socially responsible language technologies.

摘要

大型语言模型(LLMs)已深度融入日常生活,广泛应用于沟通交流、决策制定和信息检索领域,这引发了关于这些系统如何隐式形成并表达社会认知态度或"世界观"的关键问题。尽管现有研究广泛探讨了人口统计和伦理偏见,但更广泛的维度——如对权威、平等、自主性和命运的态度——仍未得到充分探索。本文基于文化理论提出"社会世界观分类法"(SWT),将四种典型世界观(等级主义、平等主义、个人主义、宿命论)操作化为可测量的子维度。通过SWT框架,我们在28个多样化LLMs中实证识别出具有区分度且可解释的认知特征。进一步受社会参照理论启发,实验证明显性社会线索能系统性塑造这些认知态度,既揭示了普遍响应模式,也呈现出细微的模型特异性差异。本研究通过揭示LLMs隐含的社会认知偏见及其对社会反馈的响应机制,增强了模型可解释性,为开发更透明且符合社会责任的语言技术提供了指导。


Restoring Calibration for Aligned Large Language Models: A Calibration-Aware Fine-Tuning Approach

Abstract

arXiv:2505.01997v1 Announce Type: cross Abstract: One of the key technologies for the success of Large Language Models (LLMs) is preference alignment. However, a notable side effect of preference alignment is poor calibration: while the pre-trained models are typically well-calibrated, LLMs tend to become poorly calibrated after alignment with human preferences. In this paper, we investigate why preference alignment affects calibration and how to address this issue. For the first question, we observe that the preference collapse issue in alignment undesirably generalizes to the calibration scenario, causing LLMs to exhibit overconfidence and poor calibration. To address this, we demonstrate the importance of fine-tuning with domain-specific knowledge to alleviate the overconfidence issue. To further analyze whether this affects the model's performance, we categorize models into two regimes: calibratable and non-calibratable, defined by bounds of Expected Calibration Error (ECE). In the calibratable regime, we propose a calibration-aware fine-tuning approach to achieve proper calibration without compromising LLMs' performance. However, as models are further fine-tuned for better performance, they enter the non-calibratable regime. For this case, we develop an EM-algorithm-based ECE regularization for the fine-tuning loss to maintain low calibration error. Extensive experiments validate the effectiveness of the proposed methods.

摘要

大型语言模型(LLMs)成功的关键技术之一是偏好对齐。然而,偏好对齐的一个显著副作用是校准效果变差:尽管预训练模型通常校准良好,但在与人类偏好对齐后,LLMs往往变得校准不佳。本文研究了偏好对齐为何影响校准以及如何解决这一问题。针对第一个问题,我们观察到对齐过程中的偏好崩溃问题会不适当地泛化到校准场景,导致LLMs表现出过度自信和校准不良。为解决这一问题,我们证明了利用领域特定知识进行微调对缓解过度自信问题的重要性。为了进一步分析这是否影响模型性能,我们将模型分为两类:可校准和不可校准,其定义基于期望校准误差(ECE)的界限。在可校准范围内,我们提出了一种校准感知的微调方法,以在不影响LLMs性能的情况下实现适当的校准。然而,随着模型进一步微调以获得更好的性能,它们会进入不可校准范围。针对这种情况,我们开发了一种基于EM算法的ECE正则化方法,用于微调损失函数以保持低校准误差。大量实验验证了所提方法的有效性。


Testing Database Systems with Large Language Model Synthesized Fragments

Abstract

arXiv:2505.02012v1 Announce Type: cross Abstract: Various automated testing approaches have been proposed for Database Management Systems (DBMSs). Many such approaches generate pairs of equivalent queries to identify bugs that cause DBMSs to compute incorrect results, and have found hundreds of bugs in mature, widely used DBMSs. Most of these approaches are based on manually written SQL generators; however, their bug-finding capabilities remain constrained by the limited set of SQL features supported by the generators. In this work, we propose ShQveL, an approach that augments existing SQL test-case generators by leveraging Large Language Models (LLMs) to synthesize SQL fragments. Our key idea is to systematically incorporate SQL features gained through automated interactions with LLMs into the SQL generators, increasing the features covered while efficiently generating test cases. Specifically, ShQveL uses SQL sketches -- SQL statements with incomplete code segments that LLMs fill -- to integrate LLM-generated content into the generator. We evaluated ShQveL on 5 DBMSs and discovered 55 unique and previously unknown bugs, 50 of which were promptly fixed after our reports.

摘要

针对数据库管理系统(DBMS),已有多种自动化测试方法被提出。其中许多方法通过生成等价查询对来识别导致DBMS计算结果错误的缺陷,并在成熟且广泛使用的DBMS中发现了数百个错误。这些方法大多基于手动编写的SQL生成器,但其缺陷检测能力仍受限于生成器支持的有限SQL功能集。本研究提出ShQveL方法,通过利用大语言模型(LLM)合成SQL片段来增强现有SQL测试用例生成器。其核心思想是通过与LLM的自动化交互,系统性地将获取的SQL功能整合到SQL生成器中,从而在高效生成测试用例的同时扩大功能覆盖范围。具体而言,ShQveL采用SQL草图(包含由LLM填充的不完整代码段的SQL语句)将LLM生成内容集成至生成器。我们在5个DBMS上评估ShQveL,发现了55个独特且未知的缺陷,其中50个在报告后得到及时修复。


Wide & Deep Learning for Node Classification

Abstract

arXiv:2505.02020v1 Announce Type: cross Abstract: Wide & Deep, a simple yet effective learning architecture for recommendation systems developed by Google, has had a significant impact in both academia and industry due to its combination of the memorization ability of generalized linear models and the generalization ability of deep models. Graph convolutional networks (GCNs) remain dominant in node classification tasks; however, recent studies have highlighted issues such as heterophily and expressiveness, which focus on graph structure while seemingly neglecting the potential role of node features. In this paper, we propose a flexible framework GCNIII, which leverages the Wide & Deep architecture and incorporates three techniques: Intersect memory, Initial residual and Identity mapping. We provide comprehensive empirical evidence showing that GCNIII can more effectively balance the trade-off between over-fitting and over-generalization on various semi- and full- supervised tasks. Additionally, we explore the use of large language models (LLMs) for node feature engineering to enhance the performance of GCNIII in cross-domain node classification tasks. Our implementation is available at https://github.com/CYCUCAS/GCNIII.

摘要

由谷歌开发的推荐系统学习架构Wide & Deep,通过结合广义线性模型的记忆能力与深度模型的泛化能力,在学术界和工业界产生了重大影响。图卷积网络(GCNs)在节点分类任务中仍占据主导地位,但近期研究揭示了诸如异质性和表达能力等问题,这些问题关注图结构的同时似乎忽视了节点特征的潜在作用。本文提出了一种灵活框架GCNIII,该框架利用Wide & Deep架构,并融合了三种技术:交集记忆(Intersect memory)、初始残差(Initial residual)和恒等映射(Identity mapping)。我们提供了全面的实证证据,表明GCNIII在各种半监督和全监督任务中能更有效地平衡过拟合与过泛化之间的权衡。此外,我们探索了使用大语言模型(LLMs)进行节点特征工程,以提升GCNIII在跨领域节点分类任务中的性能。实现代码详见https://github.com/CYCUCAS/GCNIII。


What do Language Model Probabilities Represent? From Distribution Estimation to Response Prediction

Abstract

arXiv:2505.02072v1 Announce Type: cross Abstract: The notion of language modeling has gradually shifted in recent years from a distribution over finite-length strings to general-purpose prediction models for textual inputs and outputs, following appropriate alignment phases. This paper analyzes the distinction between distribution estimation and response prediction in the context of LLMs, and their often conflicting goals. We examine the training phases of LLMs, which include pretraining, in-context learning, and preference tuning, and also the common use cases for their output probabilities, which include completion probabilities and explicit probabilities as output. We argue that the different settings lead to three distinct intended output distributions. We demonstrate that NLP works often assume that these distributions should be similar, which leads to misinterpretations of their experimental findings. Our work sets firmer formal foundations for the interpretation of LLMs, which will inform ongoing work on the interpretation and use of LLMs' induced distributions.

摘要

近年来,语言建模的概念逐渐从有限长度字符串的概率分布演变为针对文本输入输出的通用预测模型(需经过适当的对齐阶段)。本文分析了大型语言模型(LLMs)中分布估计与响应预测的区别及其常存的目标冲突。我们考察了LLMs的训练阶段(包括预训练、上下文学习和偏好微调)及其输出概率的常见应用场景(包括补全概率和显式输出概率),论证不同设定会导致三种不同的预期输出分布。研究表明,自然语言处理领域常默认这些分布应具有相似性,从而导致对实验结果的误读。本研究为LLMs的分布解释奠定了更坚实的理论基础,将为LLMs诱导分布的解读与应用研究提供重要参考。


DriveAgent: Multi-Agent Structured Reasoning with LLM and Multimodal Sensor Fusion for Autonomous Driving

Abstract

arXiv:2505.02123v1 Announce Type: cross Abstract: We introduce DriveAgent, a novel multi-agent autonomous driving framework that leverages large language model (LLM) reasoning combined with multimodal sensor fusion to enhance situational understanding and decision-making. DriveAgent uniquely integrates diverse sensor modalities-including camera, LiDAR, GPS, and IMU-with LLM-driven analytical processes structured across specialized agents. The framework operates through a modular agent-based pipeline comprising four principal modules: (i) a descriptive analysis agent identifying critical sensor data events based on filtered timestamps, (ii) dedicated vehicle-level analysis conducted by LiDAR and vision agents that collaboratively assess vehicle conditions and movements, (iii) environmental reasoning and causal analysis agents explaining contextual changes and their underlying mechanisms, and (iv) an urgency-aware decision-generation agent prioritizing insights and proposing timely maneuvers. This modular design empowers the LLM to effectively coordinate specialized perception and reasoning agents, delivering cohesive, interpretable insights into complex autonomous driving scenarios. Extensive experiments on challenging autonomous driving datasets demonstrate that DriveAgent is achieving superior performance on multiple metrics against baseline methods. These results validate the efficacy of the proposed LLM-driven multi-agent sensor fusion framework, underscoring its potential to substantially enhance the robustness and reliability of autonomous driving systems.

摘要

我们提出DriveAgent,一种创新的多智能体自动驾驶框架,通过结合大型语言模型(LLM)推理与多模态传感器融合技术,显著提升环境理解与决策能力。该框架创新性地将摄像头、激光雷达、GPS和惯性测量单元(IMU)等异构传感器数据,与基于LLM的分布式分析流程相整合。系统采用模块化智能体架构,包含四个核心功能模块:(1)描述性分析智能体基于时间戳过滤识别关键传感器事件;(2)激光雷达与视觉智能体协同执行车辆级分析,评估周边车辆状态与运动轨迹;(3)环境推理与因果分析智能体解析场景变化及其内在机理;(4)具备紧急程度感知的决策生成智能体,负责优先级判定并及时输出操控建议。这种模块化设计使LLM能高效协调各专业感知推理智能体,为复杂自动驾驶场景提供可解释的连贯分析。在多个高难度自动驾驶数据集上的实验表明,DriveAgent在多项指标上显著超越基准方法。这些结果验证了所提出的LLM驱动多智能体传感器融合框架的有效性,凸显其对于提升自动驾驶系统鲁棒性与可靠性的重要价值。


A New HOPE: Domain-agnostic Automatic Evaluation of Text Chunking

Abstract

arXiv:2505.02171v1 Announce Type: cross Abstract: Document chunking fundamentally impacts Retrieval-Augmented Generation (RAG) by determining how source materials are segmented before indexing. Despite evidence that Large Language Models (LLMs) are sensitive to the layout and structure of retrieved data, there is currently no framework to analyze the impact of different chunking methods. In this paper, we introduce a novel methodology that defines essential characteristics of the chunking process at three levels: intrinsic passage properties, extrinsic passage properties, and passages-document coherence. We propose HOPE (Holistic Passage Evaluation), a domain-agnostic, automatic evaluation metric that quantifies and aggregates these characteristics. Our empirical evaluations across seven domains demonstrate that the HOPE metric correlates significantly (p > 0.13) with various RAG performance indicators, revealing contrasts between the importance of extrinsic and intrinsic properties of passages. Semantic independence between passages proves essential for system performance with a performance gain of up to 56.2% in factual correctness and 21.1% in answer correctness. On the contrary, traditional assumptions about maintaining concept unity within passages show minimal impact. These findings provide actionable insights for optimizing chunking strategies, thus improving RAG system design to produce more factually correct responses.

摘要

文档分块技术通过决定源材料在索引前的分割方式,从根本上影响着检索增强生成(RAG)系统的性能。尽管有证据表明大语言模型(LLMs)对检索数据的布局和结构具有敏感性,但目前缺乏分析不同分块方法影响的框架。本文提出一种创新方法论,从三个层面定义分块过程的核心特征:段落内在属性、段落外在属性以及段落-文档连贯性。我们开发了HOPE(整体段落评估)这一领域无关的自动评估指标,用于量化并整合这些特征。在七个领域的实证评估表明,HOPE指标与多种RAG性能指标呈现显著相关性(p > 0.13),揭示了段落外在属性与内在属性的重要性差异。实验证明段落间的语义独立性对系统性能至关重要,可使事实准确性提升高达56.2%,答案正确率提高21.1%。相反,传统关于保持段落内概念统一性的假设影响甚微。这些发现为优化分块策略提供了可操作的见解,从而改进RAG系统设计以生成更具事实准确性的响应。


SEval-Ex: A Statement-Level Framework for Explainable Summarization Evaluation

Abstract

arXiv:2505.02235v1 Announce Type: cross Abstract: Evaluating text summarization quality remains a critical challenge in Natural Language Processing. Current approaches face a trade-off between performance and interpretability. We present SEval-Ex, a framework that bridges this gap by decomposing summarization evaluation into atomic statements, enabling both high performance and explainability. SEval-Ex employs a two-stage pipeline: first extracting atomic statements from text source and summary using LLM, then a matching between generated statements. Unlike existing approaches that provide only summary-level scores, our method generates detailed evidence for its decisions through statement-level alignments. Experiments on the SummEval benchmark demonstrate that SEval-Ex achieves state-of-the-art performance with 0.580 correlation on consistency with human consistency judgments, surpassing GPT-4 based evaluators (0.521) while maintaining interpretability. Finally, our framework shows robustness against hallucination.

摘要

文本摘要质量评估仍是自然语言处理领域的关键挑战。现有方法面临性能与可解释性之间的权衡问题。本文提出SEval-Ex框架,通过将摘要评估分解为原子陈述来弥合这一鸿沟,实现高性能与可解释性的统一。该框架采用两阶段流程:首先利用大语言模型从原文和摘要中提取原子陈述,随后进行生成陈述的匹配。与仅提供摘要级评分的现有方法不同,我们的方法通过陈述级对齐为决策生成详细证据。在SummEval基准测试中,SEval-Ex以0.580的人类一致性判断相关性达到最先进性能,超越基于GPT-4的评估器(0.521),同时保持可解释性。最后,本框架对幻觉现象展现出强鲁棒性。


Parameter-Efficient Transformer Embeddings

Abstract

arXiv:2505.02266v1 Announce Type: cross Abstract: Embedding layers in transformer-based NLP models typically account for the largest share of model parameters, scaling with vocabulary size but not yielding performance gains proportional to scale. We propose an alternative approach in which token embedding vectors are first generated deterministically, directly from the token IDs using a Fourier expansion of their normalized values, followed by a lightweight multilayer perceptron (MLP) that captures higher-order interactions. We train standard transformers and our architecture on natural language inference tasks (SNLI and MNLI), and evaluate zero-shot performance on sentence textual similarity (STS-B). Our results demonstrate that the proposed method achieves competitive performance using significantly fewer parameters, trains faster, and operates effectively without the need for dropout. This proof-of-concept study highlights the potential for scalable, memory-efficient language models and motivates further large-scale experimentation based on our findings.

摘要

基于Transformer的自然语言处理模型中,嵌入层通常占据模型参数的最大比重,其规模随词汇表大小增长却未能带来相应的性能提升。本文提出一种创新方法:首先通过对归一化标记ID进行傅里叶展开来确定性生成标记嵌入向量,随后通过轻量级多层感知机(MLP)捕获高阶交互。我们在自然语言推理任务(SNLI和MNLI)上训练标准Transformer和本架构,并在句子文本相似度任务(STS-B)评估零样本性能。实验结果表明,所提方法以显著更少的参数量达到竞争性性能,训练速度更快,且无需dropout即可有效运行。这项概念验证研究揭示了构建可扩展、内存高效语言模型的潜力,并为基于本发现的大规模实验提供了研究动机。


Optimizing LLMs for Resource-Constrained Environments: A Survey of Model Compression Techniques

Abstract

arXiv:2505.02309v1 Announce Type: cross Abstract: Large Language Models (LLMs) have revolutionized many areas of artificial intelligence (AI), but their substantial resource requirements limit their deployment on mobile and edge devices. This survey paper provides a comprehensive overview of techniques for compressing LLMs to enable efficient inference in resource-constrained environments. We examine three primary approaches: Knowledge Distillation, Model Quantization, and Model Pruning. For each technique, we discuss the underlying principles, present different variants, and provide examples of successful applications. We also briefly discuss complementary techniques such as mixture-of-experts and early-exit strategies. Finally, we highlight promising future directions, aiming to provide a valuable resource for both researchers and practitioners seeking to optimize LLMs for edge deployment.

摘要

大语言模型(LLMs)已经彻底改变了人工智能(AI)的许多领域,但其巨大的资源需求限制了它们在移动和边缘设备上的部署。本综述论文全面概述了压缩LLMs的技术,以实现在资源受限环境中的高效推理。我们研究了三种主要方法:知识蒸馏、模型量化和模型剪枝。针对每种技术,我们讨论了其基本原理,介绍了不同的变体,并提供了成功应用的示例。我们还简要讨论了混合专家和早期退出策略等补充技术。最后,我们强调了未来有前景的研究方向,旨在为研究人员和实践者提供一个有价值的资源,帮助他们优化LLMs以实现边缘部署。


Advancing Email Spam Detection: Leveraging Zero-Shot Learning and Large Language Models

Abstract

arXiv:2505.02362v1 Announce Type: cross Abstract: Email spam detection is a critical task in modern communication systems, essential for maintaining productivity, security, and user experience. Traditional machine learning and deep learning approaches, while effective in static settings, face significant limitations in adapting to evolving spam tactics, addressing class imbalance, and managing data scarcity. These challenges necessitate innovative approaches that reduce dependency on extensive labeled datasets and frequent retraining. This study investigates the effectiveness of Zero-Shot Learning using FLAN-T5, combined with advanced Natural Language Processing (NLP) techniques such as BERT for email spam detection. By employing BERT to preprocess and extract critical information from email content, and FLAN-T5 to classify emails in a Zero-Shot framework, the proposed approach aims to address the limitations of traditional spam detection systems. The integration of FLAN-T5 and BERT enables robust spam detection without relying on extensive labeled datasets or frequent retraining, making it highly adaptable to unseen spam patterns and adversarial environments. This research highlights the potential of leveraging zero-shot learning and NLPs for scalable and efficient spam detection, providing insights into their capability to address the dynamic and challenging nature of spam detection tasks.

摘要

电子邮件垃圾邮件检测是现代通信系统中的关键任务,对维护生产力、安全性和用户体验至关重要。传统机器学习和深度学习方法虽然在静态环境中有效,但在适应不断演变的垃圾邮件策略、解决类别不平衡问题以及处理数据稀缺性方面存在显著局限性。这些挑战要求采用创新方法,以减少对大量标注数据集和频繁重新训练的依赖。本研究探讨了使用FLAN-T5的零样本学习结合先进自然语言处理(NLP)技术(如BERT)在电子邮件垃圾邮件检测中的有效性。通过利用BERT预处理和提取邮件内容中的关键信息,并采用FLAN-T5在零样本框架下对邮件进行分类,所提出的方法旨在解决传统垃圾邮件检测系统的局限性。FLAN-T5与BERT的结合实现了无需依赖大量标注数据或频繁重新训练的鲁棒垃圾邮件检测,使其对未见过的垃圾邮件模式和对抗性环境具有高度适应性。本研究凸显了利用零样本学习和NLP技术实现可扩展且高效垃圾邮件检测的潜力,为应对垃圾邮件检测任务的动态性和挑战性提供了新的见解。


RM-R1: Reward Modeling as Reasoning

Abstract

arXiv:2505.02387v1 Announce Type: cross Abstract: Reward modeling is essential for aligning large language models (LLMs) with human preferences, especially through reinforcement learning from human feedback (RLHF). To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. However, existing RMs either produce opaque scalar scores or directly generate the prediction of a preferred answer, making them struggle to integrate natural language critiques, thus lacking interpretability. Inspired by recent advances of long chain-of-thought (CoT) on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning capabilities into reward modeling significantly enhances RM's interpretability and performance. In this work, we introduce a new class of generative reward models -- Reasoning Reward Models (ReasRMs) -- which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1. The training consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. RM-R1 improves LLM rollouts by self-generating reasoning traces or chat-specific rubrics and evaluating candidate responses against them. Empirically, our models achieve state-of-the-art or near state-of-the-art performance of generative RMs across multiple comprehensive reward model benchmarks, outperforming much larger open-weight models (e.g., Llama3.1-405B) and proprietary ones (e.g., GPT-4o) by up to 13.8%. Beyond final performance, we perform thorough empirical analysis to understand the key ingredients of successful ReasRM training. To facilitate future research, we release six ReasRM models along with code and data at https://github.com/RM-R1-UIUC/RM-R1.

摘要

奖励建模对于使大语言模型(LLM)与人类偏好对齐至关重要,尤其是通过基于人类反馈的强化学习(RLHF)。为了提供准确的奖励信号,奖励模型(RM)应在给出评分或判断前激发深度思考并进行可解释的推理。然而,现有RM要么生成不透明的标量分数,要么直接预测优选答案,导致其难以整合自然语言批评,因而缺乏可解释性。受长思维链(CoT)在推理密集型任务中的最新进展启发,我们提出假设并验证:将推理能力整合到奖励建模中可显著提升RM的可解释性和性能。本文提出了一类新的生成式奖励模型——推理奖励模型(ReasRM),将奖励建模构建为推理任务。我们设计了面向推理的训练流程,并训练了ReasRM系列模型RM-R1。训练包含两个关键阶段:(1)高质量推理链的蒸馏;(2)基于可验证奖励的强化学习。RM-R1通过自生成推理轨迹或对话专用评分标准,并据此评估候选响应,从而改进LLM输出。实验表明,我们的模型在多个综合奖励模型基准测试中达到或接近生成式RM的最先进性能,最高可超越大型开源模型(如Llama3.1-405B)和专有模型(如GPT-4o)达13.8%。除最终性能外,我们还进行了全面实证分析以理解成功训练ReasRM的关键要素。为促进未来研究,我们在https://github.com/RM-R1-UIUC/RM-R1发布了六个ReasRM模型及相关代码与数据。


Quantitative Analysis of Performance Drop in DeepSeek Model Quantization

Abstract

arXiv:2505.02390v1 Announce Type: cross Abstract: Recently, there is a high demand for deploying DeepSeek-R1 and V3 locally, possibly because the official service often suffers from being busy and some organizations have data privacy concerns. While single-machine deployment offers infrastructure simplicity, the models' 671B FP8 parameter configuration exceeds the practical memory limits of a standard 8-GPU machine. Quantization is a widely used technique that helps reduce model memory consumption. However, it is unclear what the performance of DeepSeek-R1 and V3 will be after being quantized. This technical report presents the first quantitative evaluation of multi-bitwidth quantization across the complete DeepSeek model spectrum. Key findings reveal that 4-bit quantization maintains little performance degradation versus FP8 while enabling single-machine deployment on standard NVIDIA GPU devices. We further propose DQ3_K_M, a dynamic 3-bit quantization method that significantly outperforms traditional Q3_K_M variant on various benchmarks, which is also comparable with 4-bit quantization (Q4_K_M) approach in most tasks. Moreover, DQ3_K_M supports single-machine deployment configurations for both NVIDIA H100/A100 and Huawei 910B. Our implementation of DQ3_K_M is released at https://github.com/UnicomAI/DeepSeek-Eval, containing optimized 3-bit quantized variants of both DeepSeek-R1 and DeepSeek-V3.

摘要

近期,本地化部署DeepSeek-R1和V3的需求激增,可能源于官方服务常处于繁忙状态且部分机构存在数据隐私顾虑。虽然单机部署具有基础设施简单的优势,但模型671B FP8参数配置超出了标准8-GPU机器的实际内存限制。量化作为一种广泛应用的技术,可有效降低模型内存占用。然而,目前尚不清楚DeepSeek-R1和V3量化后的性能表现。本技术报告首次对DeepSeek全系列模型进行了多比特宽度量化的系统性评估。关键发现表明:4比特量化在保持与FP8相近性能的同时,可实现标准NVIDIA GPU设备的单机部署。我们进一步提出DQ3_K_M动态3比特量化方法,其在多项基准测试中显著优于传统Q3_K_M变体,且在大多数任务中与4比特量化(Q4_K_M)方法性能相当。此外,DQ3_K_M同时支持NVIDIA H100/A100和华为910B的单机部署配置。DQ3_K_M的实现已发布于https://github.com/UnicomAI/DeepSeek-Eval,包含DeepSeek-R1和DeepSeek-V3的优化3比特量化变体。


Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL

Abstract

arXiv:2505.02391v1 Announce Type: cross Abstract: Chain-of-thought (CoT) reasoning in large language models (LLMs) can be formalized as a latent variable problem, where the model needs to generate intermediate reasoning steps. While prior approaches such as iterative reward-ranked fine-tuning (RAFT) have relied on such formulations, they typically apply uniform inference budgets across prompts, which fails to account for variability in difficulty and convergence behavior. This work identifies the main bottleneck in CoT training as inefficient stochastic gradient estimation due to static sampling strategies. We propose GVM-RAFT, a prompt-specific Dynamic Sample Allocation Strategy designed to minimize stochastic gradient variance under a computational budget constraint. The method dynamically allocates computational resources by monitoring prompt acceptance rates and stochastic gradient norms, ensuring that the resulting gradient variance is minimized. Our theoretical analysis shows that the proposed dynamic sampling strategy leads to accelerated convergence guarantees under suitable conditions. Experiments on mathematical reasoning show that GVM-RAFT achieves a 2-4x speedup and considerable accuracy improvements over vanilla RAFT. The proposed dynamic sampling strategy is general and can be incorporated into other reinforcement learning algorithms, such as GRPO, leading to similar improvements in convergence and test accuracy. Our code is available at https://github.com/RLHFlow/GVM.

摘要

大语言模型(LLMs)中的思维链(CoT)推理可形式化为一个潜在变量问题,即模型需要生成中间推理步骤。尽管迭代奖励排序微调(RAFT)等现有方法依赖此类形式化框架,但它们通常对所有提示采用统一的推理预算,未能考虑问题难度与收敛行为的差异性。本研究指出CoT训练的主要瓶颈在于静态采样策略导致的随机梯度估计效率低下。我们提出GVM-RAFT方法——一种针对特定提示的动态样本分配策略,旨在计算预算约束下最小化随机梯度方差。该方法通过监测提示接受率和随机梯度范数动态分配计算资源,确保所得梯度方差最小化。理论分析表明,所提出的动态采样策略在适当条件下可实现加速收敛保证。数学推理实验显示,GVM-RAFT相比原始RAFT实现了2-4倍加速,并带来显著准确率提升。该动态采样策略具有通用性,可整合至GRPO等其他强化学习算法,同样能改善收敛性和测试准确率。代码已开源:https://github.com/RLHFlow/GVM。


Bielik 11B v2 Technical Report

Abstract

arXiv:2505.02410v1 Announce Type: cross Abstract: We present Bielik 11B v2, a state-of-the-art language model optimized for Polish text processing. Built on the Mistral 7B v0.2 architecture and scaled to 11B parameters using depth up-scaling, this model demonstrates exceptional performance across Polish language benchmarks while maintaining strong cross-lingual capabilities. We introduce two key technical innovations: Weighted Instruction Cross-Entropy Loss, which optimizes learning across diverse instruction types by assigning quality-based weights to training examples, and Adaptive Learning Rate, which dynamically adjusts based on context length. Comprehensive evaluation across multiple benchmarks demonstrates that Bielik 11B v2 outperforms many larger models, including those with 2-6 times more parameters, and significantly surpasses other specialized Polish language models on tasks ranging from linguistic understanding to complex reasoning. The model's parameter efficiency and extensive quantization options enable deployment across various hardware configurations, advancing Polish language AI capabilities and establishing new benchmarks for resource-efficient language modeling in less-represented languages.

摘要

我们推出Bielik 11B v2——专为波兰语文本处理优化的尖端语言模型。该模型基于Mistral 7B v0.2架构,通过深度扩展技术将参数量提升至110亿,在保持强大跨语言能力的同时,于波兰语基准测试中展现出卓越性能。我们引入两项关键技术创新:基于质量权重分配训练样本的加权指令交叉熵损失函数,可优化跨指令类型的学习效果;以及根据上下文长度动态调整的自适应学习率。多基准测试的综合评估表明,Bielik 11B v2在从语言理解到复杂推理的各项任务中,不仅超越了许多参数量为其2-6倍的更大模型,更显著优于其他波兰语专用模型。该模型凭借参数高效性和广泛的量化选项,可适配多种硬件配置部署,既推动了波兰语人工智能的发展,也为资源受限语言的高效建模确立了新基准。


Automated Hybrid Reward Scheduling via Large Language Models for Robotic Skill Learning

Abstract

arXiv:2505.02483v1 Announce Type: cross Abstract: Enabling a high-degree-of-freedom robot to learn specific skills is a challenging task due to the complexity of robotic dynamics. Reinforcement learning (RL) has emerged as a promising solution; however, addressing such problems requires the design of multiple reward functions to account for various constraints in robotic motion. Existing approaches typically sum all reward components indiscriminately to optimize the RL value function and policy. We argue that this uniform inclusion of all reward components in policy optimization is inefficient and limits the robot's learning performance. To address this, we propose an Automated Hybrid Reward Scheduling (AHRS) framework based on Large Language Models (LLMs). This paradigm dynamically adjusts the learning intensity of each reward component throughout the policy optimization process, enabling robots to acquire skills in a gradual and structured manner. Specifically, we design a multi-branch value network, where each branch corresponds to a distinct reward component. During policy optimization, each branch is assigned a weight that reflects its importance, and these weights are automatically computed based on rules designed by LLMs. The LLM generates a rule set in advance, derived from the task description, and during training, it selects a weight calculation rule from the library based on language prompts that evaluate the performance of each branch. Experimental results demonstrate that the AHRS method achieves an average 6.48% performance improvement across multiple high-degree-of-freedom robotic tasks.

摘要

由于机器人动力学的高度复杂性,让高自由度机器人学习特定技能是一项极具挑战性的任务。强化学习(RL)已成为一种有前景的解决方案,但处理此类问题需要设计多个奖励函数以兼顾机器人运动中的各种约束。现有方法通常不加区分地将所有奖励分量相加来优化强化学习的价值函数和策略。我们认为,在策略优化中统一纳入所有奖励分量的做法效率低下,且限制了机器人的学习性能。为此,我们提出了一种基于大语言模型(LLMs)的自动混合奖励调度(AHRS)框架。该范式能在策略优化过程中动态调整各奖励分量的学习强度,使机器人能够以渐进、结构化的方式掌握技能。具体而言,我们设计了一个多分支价值网络,每个分支对应不同的奖励分量。在策略优化时,每个分支会根据其重要性被赋予相应权重,这些权重由LLMs设计的规则自动计算得出。LLM会预先根据任务描述生成规则集,并在训练过程中根据评估各分支性能的语言提示从规则库中选择权重计算规则。实验结果表明,AHRS方法在多个高自由度机器人任务中平均实现了6.48%的性能提升。


SEFE: Superficial and Essential Forgetting Eliminator for Multimodal Continual Instruction Tuning

Abstract

arXiv:2505.02486v1 Announce Type: cross Abstract: Multimodal Continual Instruction Tuning (MCIT) aims to enable Multimodal Large Language Models (MLLMs) to incrementally learn new tasks without catastrophic forgetting. In this paper, we explore forgetting in this context, categorizing it into superficial forgetting and essential forgetting. Superficial forgetting refers to cases where the model's knowledge may not be genuinely lost, but its responses to previous tasks deviate from expected formats due to the influence of subsequent tasks' answer styles, making the results unusable. By contrast, essential forgetting refers to situations where the model provides correctly formatted but factually inaccurate answers, indicating a true loss of knowledge. Assessing essential forgetting necessitates addressing superficial forgetting first, as severe superficial forgetting can obscure the model's knowledge state. Hence, we first introduce the Answer Style Diversification (ASD) paradigm, which defines a standardized process for transforming data styles across different tasks, unifying their training sets into similarly diversified styles to prevent superficial forgetting caused by style shifts. Building on this, we propose RegLoRA to mitigate essential forgetting. RegLoRA stabilizes key parameters where prior knowledge is primarily stored by applying regularization, enabling the model to retain existing competencies. Experimental results demonstrate that our overall method, SEFE, achieves state-of-the-art performance.

摘要

多模态持续指令微调(MCIT)旨在使多模态大语言模型(MLLMs)能够增量学习新任务而不发生灾难性遗忘。本文针对该场景下的遗忘现象进行探究,将其划分为表层遗忘与本质遗忘:表层遗忘指模型知识可能并未真正丢失,但由于后续任务答案风格的干扰,导致其对先前任务的响应偏离预期格式,致使结果无法使用;本质遗忘则指模型输出格式正确但事实错误的答案,表明知识确实丧失。评估本质遗忘需先解决表层遗忘,因严重的表层遗忘会掩盖模型真实知识状态。为此,我们首先提出答案风格多样化(ASD)范式,通过定义跨任务数据风格转换的标准化流程,将各任务训练集统一为相似多样化风格,以预防风格迁移导致的表层遗忘。在此基础上,我们提出RegLoRA来缓解本质遗忘——该方法通过正则化稳定存储先验知识的关键参数,使模型保持现有能力。实验结果表明,我们的整体方法SEFE取得了最先进的性能表现。


Unveiling the Landscape of LLM Deployment in the Wild: An Empirical Study

Abstract

arXiv:2505.02502v1 Announce Type: cross Abstract: Background: Large language models (LLMs) are increasingly deployed via open-source and commercial frameworks, enabling individuals and organizations to self-host advanced AI capabilities. However, insecure defaults and misconfigurations often expose LLM services to the public Internet, posing significant security and system engineering risks. Aims: This study aims to unveil the current landscape of public-facing LLM deployments in the wild through a large-scale empirical study, focusing on service prevalence, exposure characteristics, systemic vulnerabilities, and associated risks. Method: We conducted an Internet-wide measurement to identify public-facing LLM deployments across 15 frameworks, discovering 320,102 services. We extracted 158 unique API endpoints, grouped into 12 functional categories based on capabilities and security risks. We further analyzed configurations, authentication practices, and geographic distributions, revealing deployment trends and systemic issues in real-world LLM system engineering. Results: Our study shows that public LLM deployments are rapidly growing but often insecure. Among all endpoints, we observe widespread use of insecure protocols, poor TLS configurations, and unauthenticated access to critical operations. Security risks, including model disclosure, system leakage, and unauthorized access, are pervasive, highlighting the need for secure-by-default frameworks and stronger deployment practices. Conclusions: Public-facing LLM deployments suffer from widespread security and configuration flaws, exposing services to misuse, model theft, resource hijacking, and remote exploitation. Strengthening default security, deployment practices, and operational standards is critical for the growing self-hosted LLM ecosystem.

摘要

背景:大型语言模型(LLMs)正越来越多地通过开源和商业框架部署,使个人和组织能够自主托管先进AI能力。然而,不安全的默认设置和错误配置常使LLM服务暴露于公共互联网,带来重大安全与系统工程风险。目标:本研究旨在通过大规模实证研究揭示当前公共LLM部署现状,重点关注服务普及度、暴露特征、系统性漏洞及相关风险。方法:我们实施了全网测量,识别出15个框架下的320,102个公共LLM服务。提取158个独特API端点,根据功能与安全风险划分为12个类别。通过分析配置策略、认证实践和地理分布,揭示了实际LLM系统工程中的部署趋势与系统性问题。结果:研究表明公共LLM部署快速增长但普遍存在安全隐患。所有端点中普遍存在不安全协议使用、TLS配置缺陷及关键操作未授权访问等问题。模型泄露、系统信息泄漏和未授权访问等安全风险广泛存在,凸显了默认安全框架和强化部署实践的必要性。结论:面向公众的LLM部署存在普遍的安全与配置缺陷,导致服务滥用、模型窃取、资源劫持和远程攻击风险。强化默认安全性、部署实践和操作标准对日益增长的自托管LLM生态系统至关重要。


Bielik v3 Small: Technical Report

Abstract

arXiv:2505.02550v1 Announce Type: cross Abstract: We introduce Bielik v3, a series of parameter-efficient generative text models (1.5B and 4.5B) optimized for Polish language processing. These models demonstrate that smaller, well-optimized architectures can achieve performance comparable to much larger counterparts while requiring substantially fewer computational resources. Our approach incorporates several key innovations: a custom Polish tokenizer (APT4) that significantly improves token efficiency, Weighted Instruction Cross-Entropy Loss to balance learning across instruction types, and Adaptive Learning Rate that dynamically adjusts based on training progress. Trained on a meticulously curated corpus of 292 billion tokens spanning 303 million documents, these models excel across multiple benchmarks, including the Open PL LLM Leaderboard, Complex Polish Text Understanding Benchmark, Polish EQ-Bench, and Polish Medical Leaderboard. The 4.5B parameter model achieves results competitive with models 2-3 times its size, while the 1.5B model delivers strong performance despite its extremely compact profile. These advances establish new benchmarks for parameter-efficient language modeling in less-represented languages, making high-quality Polish language AI more accessible for resource-constrained applications.

摘要

我们推出Bielik v3系列参数高效生成文本模型(15亿和45亿参数),专为波兰语处理优化。研究表明,经过精心优化的较小架构在显著减少计算资源需求的同时,仍能达到与更大规模模型相当的性能。该研究包含多项关键创新:定制波兰语分词器(APT4)显著提升分词效率,加权指令交叉熵损失函数平衡不同类型指令的学习,以及基于训练进度动态调整的自适应学习率。这些模型在精心筛选的2920亿标记、覆盖3.03亿文档的语料库上进行训练,在多项基准测试中表现卓越,包括Open PL大语言模型排行榜、复杂波兰语文本理解基准、波兰EQ-Bench及波兰医学排行榜。其中45亿参数模型的性能可媲美其2-3倍规模的模型,而15亿参数模型在极度紧凑的结构下仍展现出强劲性能。这些进展为资源受限应用中实现高质量波兰语AI建立了新基准,为低资源语言的高效参数建模树立了新标准。


EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning

Abstract

arXiv:2505.02579v1 Announce Type: cross Abstract: Recent advances in reinforcement learning (RL) for large language model (LLM) fine-tuning show promise in addressing multi-objective tasks but still face significant challenges, including complex objective balancing, low training efficiency, poor scalability, and limited explainability. Leveraging ensemble learning principles, we introduce an Ensemble Multi-Objective RL (EMORL) framework that fine-tunes multiple models with individual objectives while optimizing their aggregation after the training to improve efficiency and flexibility. Our method is the first to aggregate the last hidden states of individual models, incorporating contextual information from multiple objectives. This approach is supported by a hierarchical grid search algorithm that identifies optimal weighted combinations. We evaluate EMORL on counselor reflection generation tasks, using text-scoring LLMs to evaluate the generations and provide rewards during RL fine-tuning. Through comprehensive experiments on the PAIR and Psych8k datasets, we demonstrate the advantages of EMORL against existing baselines: significantly lower and more stable training consumption (17,529±1,65017,529\pm 1,650 data points and 6,573±147.436,573\pm 147.43 seconds), improved scalability and explainability, and comparable performance across multiple objectives.

摘要

近年来,基于强化学习(RL)的大语言模型(LLM)微调技术在多目标任务处理方面展现出潜力,但仍面临目标平衡复杂、训练效率低下、可扩展性差和可解释性有限等挑战。借鉴集成学习原理,我们提出一种集成多目标强化学习(EMORL)框架:通过为各目标分别微调独立模型,并在训练后优化模型聚合策略以提升效率与灵活性。该方法首创性地聚合各模型的最后隐藏状态,融合多目标的上下文信息,并通过分层网格搜索算法确定最优权重组合。在心理咨询师反思生成任务中,我们采用文本评分LLM对生成内容进行评估并提供RL微调奖励。基于PAIR和Psych8k数据集的实验表明,EMORL相较基线方法具有显著优势:训练消耗显著降低且更稳定(17,529±1,650个数据点,6,573±147.43秒)、可扩展性与可解释性提升,同时在多目标上保持相当性能。


LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis

Abstract

arXiv:2505.02625v1 Announce Type: cross Abstract: Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achieving high-quality real-time speech interaction. LLaMA-Omni 2 is built upon the Qwen2.5 series models, integrating a speech encoder and an autoregressive streaming speech decoder. Despite being trained on only 200K multi-turn speech dialogue samples, LLaMA-Omni 2 demonstrates strong performance on several spoken question answering and speech instruction following benchmarks, surpassing previous state-of-the-art SpeechLMs like GLM-4-Voice, which was trained on millions of hours of speech data.

摘要

实时、智能且自然的语音交互是下一代人机交互的核心组成部分。近期研究进展表明,基于大语言模型(LLMs)构建智能语音聊天机器人具有巨大潜力。本文提出LLaMA-Omni 2系列语音语言模型(SpeechLMs),其参数量级覆盖0.5B至14B,能够实现高质量的实时语音交互。该模型基于Qwen2.5系列架构,整合了语音编码器与自回归流式语音解码器。尽管仅使用20万轮多轮语音对话样本进行训练,LLaMA-Omni 2在多项语音问答和语音指令跟随基准测试中表现出色,其性能超越此前基于数百万小时语音数据训练的先进语音语言模型(如GLM-4-Voice)。


Enhancing Chemical Reaction and Retrosynthesis Prediction with Large Language Model and Dual-task Learning

Abstract

arXiv:2505.02639v1 Announce Type: cross Abstract: Chemical reaction and retrosynthesis prediction are fundamental tasks in drug discovery. Recently, large language models (LLMs) have shown potential in many domains. However, directly applying LLMs to these tasks faces two major challenges: (i) lacking a large-scale chemical synthesis-related instruction dataset; (ii) ignoring the close correlation between reaction and retrosynthesis prediction for the existing fine-tuning strategies. To address these challenges, we propose ChemDual, a novel LLM framework for accurate chemical synthesis. Specifically, considering the high cost of data acquisition for reaction and retrosynthesis, ChemDual regards the reaction-and-retrosynthesis of molecules as a related recombination-and-fragmentation process and constructs a large-scale of 4.4 million instruction dataset. Furthermore, ChemDual introduces an enhanced LLaMA, equipped with a multi-scale tokenizer and dual-task learning strategy, to jointly optimize the process of recombination and fragmentation as well as the tasks between reaction and retrosynthesis prediction. Extensive experiments on Mol-Instruction and USPTO-50K datasets demonstrate that ChemDual achieves state-of-the-art performance in both predictions of reaction and retrosynthesis, outperforming the existing conventional single-task approaches and the general open-source LLMs. Through molecular docking analysis, ChemDual generates compounds with diverse and strong protein binding affinity, further highlighting its strong potential in drug design.

摘要

化学反应与逆合成预测是药物发现中的基础任务。近年来,大型语言模型(LLMs)在多个领域展现出潜力。然而,直接将其应用于这些任务面临两大挑战:(i)缺乏大规模化学合成相关指令数据集;(ii)现有微调策略忽略了反应预测与逆合成预测间的紧密关联。为解决这些问题,我们提出ChemDual——一个用于精确化学合成的新型LLM框架。具体而言,针对反应与逆合成数据获取成本高的问题,ChemDual将分子反应-逆合成视为相关的重组-碎片化过程,构建了包含440万条指令的大规模数据集。此外,ChemDual采用增强版LLaMA模型,配备多尺度分词器与双任务学习策略,联合优化重组-碎片化过程及反应-逆合成预测任务。在Mol-Instruction和USPTO-50K数据集上的大量实验表明,ChemDual在反应与逆合成预测中均达到最先进性能,优于现有传统单任务方法和通用开源LLMs。通过分子对接分析,ChemDual生成的化合物具有多样且强大的蛋白质结合亲和力,进一步凸显其在药物设计中的强大潜力。


A Note on Statistically Accurate Tabular Data Generation Using Large Language Models

Abstract

arXiv:2505.02659v1 Announce Type: cross Abstract: Large language models (LLMs) have shown promise in synthetic tabular data generation, yet existing methods struggle to preserve complex feature dependencies, particularly among categorical variables. This work introduces a probability-driven prompting approach that leverages LLMs to estimate conditional distributions, enabling more accurate and scalable data synthesis. The results highlight the potential of prompting probobility distributions to enhance the statistical fidelity of LLM-generated tabular data.

摘要

大语言模型(LLMs)在合成表格数据生成方面展现出潜力,但现有方法难以保持复杂特征依赖关系,尤其是分类变量之间的关联。本研究提出一种概率驱动的提示方法,利用LLMs估计条件概率分布,从而实现更精确、可扩展的数据合成。结果表明,通过提示概率分布能有效提升LLM生成表格数据的统计保真度。


AI Standardized Patient Improves Human Conversations in Advanced Cancer Care

Abstract

arXiv:2505.02694v1 Announce Type: cross Abstract: Serious illness communication (SIC) in end-of-life care faces challenges such as emotional stress, cultural barriers, and balancing hope with honesty. Despite its importance, one of the few available ways for clinicians to practice SIC is with standardized patients, which is expensive, time-consuming, and inflexible. In this paper, we present SOPHIE, an AI-powered standardized patient simulation and automated feedback system. SOPHIE combines large language models (LLMs), a lifelike virtual avatar, and automated, personalized feedback based on clinical literature to provide remote, on-demand SIC training. In a randomized control study with healthcare students and professionals, SOPHIE users demonstrated significant improvement across three critical SIC domains: Empathize, Be Explicit, and Empower. These results suggest that AI-driven tools can enhance complex interpersonal communication skills, offering scalable, accessible solutions to address a critical gap in clinician education.

摘要

临终关怀中的重病沟通(SIC)面临情感压力、文化障碍及希望与诚实间平衡等挑战。尽管其重要性显著,临床医师目前仅能通过标准化病人进行有限实践,这种方式成本高昂、耗时且缺乏灵活性。本文提出SOPHIE系统——一种基于人工智能的标准化病人模拟与自动化反馈系统。该系统整合大型语言模型(LLMs)、逼真虚拟形象及基于临床文献的自动化个性化反馈,可提供远程按需SIC培训。针对医疗学员与专业人员的随机对照研究表明,SOPHIE使用者在"共情"、"明确表达"和"赋权"三个核心SIC领域均取得显著提升。这些结果表明,人工智能驱动工具能够增强复杂人际沟通技能,为临床医师教育中的关键缺口提供可扩展、易获取的解决方案。


Knowledge Graphs for Enhancing Large Language Models in Entity Disambiguation

Abstract

arXiv:2505.02737v1 Announce Type: cross Abstract: Recent advances in Large Language Models (LLMs) have positioned them as a prominent solution for Natural Language Processing tasks. Notably, they can approach these problems in a zero or few-shot manner, thereby eliminating the need for training or fine-tuning task-specific models. However, LLMs face some challenges, including hallucination and the presence of outdated knowledge or missing information from specific domains in the training data. These problems cannot be easily solved by retraining the models with new data as it is a time-consuming and expensive process. To mitigate these issues, Knowledge Graphs (KGs) have been proposed as a structured external source of information to enrich LLMs. With this idea, in this work we use KGs to enhance LLMs for zero-shot Entity Disambiguation (ED). For that purpose, we leverage the hierarchical representation of the entities' classes in a KG to gradually prune the candidate space as well as the entities' descriptions to enrich the input prompt with additional factual knowledge. Our evaluation on popular ED datasets shows that the proposed method outperforms non-enhanced and description-only enhanced LLMs, and has a higher degree of adaptability than task-specific models. Furthermore, we conduct an error analysis and discuss the impact of the leveraged KG's semantic expressivity on the ED performance.

摘要

大语言模型(LLMs)的最新进展使其成为自然语言处理任务的重要解决方案。值得注意的是,它们能以零样本或少样本方式处理这些问题,从而无需训练或微调特定任务模型。然而,LLMs面临一些挑战,包括幻觉问题以及训练数据中存在过时知识或特定领域信息缺失。这些问题难以通过重新训练模型来解决,因为该过程耗时且成本高昂。为缓解这些缺陷,知识图谱(KGs)被提出作为结构化外部信息源来增强LLMs。基于此思路,本研究利用KGs提升LLMs在零样本实体消歧(ED)中的表现。我们通过KG中实体类别的层次化表示逐步剪枝候选空间,并利用实体描述丰富输入提示的附加事实知识。在主流ED数据集上的评估表明,该方法优于未增强及仅使用描述增强的LLMs,且比特定任务模型具有更高适应性。此外,我们进行了错误分析,并探讨了所采用KG的语义表达能力对ED性能的影响。


Abstract

arXiv:2505.02763v1 Announce Type: cross Abstract: Legal practice requires careful adherence to procedural rules. In the United States, few are more complex than those found in The Bluebook: A Uniform System of Citation. Compliance with this system's 500+ pages of byzantine formatting instructions is the raison d'etre of thousands of student law review editors and the bete noire of lawyers everywhere. To evaluate whether large language models (LLMs) are able to adhere to the procedures of such a complicated system, we construct an original dataset of 866 Bluebook tasks and test flagship LLMs from OpenAI, Anthropic, Google, Meta, and DeepSeek. We show (1) that these models produce fully compliant Bluebook citations only 69%-74% of the time and (2) that in-context learning on the Bluebook's underlying system of rules raises accuracy only to 77%. These results caution against using off-the-shelf LLMs to automate aspects of the law where fidelity to procedure is paramount.

摘要

法律实践需要严格遵守程序规则。在美国,最复杂的程序规则莫过于《蓝皮书:统一注释体系》。遵循这本500多页错综复杂的格式指南,既是数千名法律评论学生编辑存在的理由,也是各地律师的噩梦。为评估大语言模型(LLMs)能否遵守如此复杂体系的程序规则,我们构建了包含866项蓝皮书任务的原创数据集,并测试了来自OpenAI、Anthropic、Google、Meta和深度求索公司的旗舰模型。研究表明:(1)这些模型生成的蓝皮书引文完全合规率仅为69%-74%;(2)通过对蓝皮书底层规则体系进行上下文学习,准确率仅提升至77%。这些结果表明,在程序保真度至关重要的法律领域自动化应用中,需谨慎使用现成的大语言模型。


HSplitLoRA: A Heterogeneous Split Parameter-Efficient Fine-Tuning Framework for Large Language Models

Abstract

arXiv:2505.02795v1 Announce Type: cross Abstract: Recently, large language models (LLMs) have achieved remarkable breakthroughs, revolutionizing the natural language processing domain and beyond. Due to immense parameter sizes, fine-tuning these models with private data for diverse downstream tasks has become mainstream. Though federated learning (FL) offers a promising solution for fine-tuning LLMs without sharing raw data, substantial computing costs hinder its democratization. Moreover, in real-world scenarios, private client devices often possess heterogeneous computing resources, further complicating LLM fine-tuning. To combat these challenges, we propose HSplitLoRA, a heterogeneous parameter-efficient fine-tuning (PEFT) framework built on split learning (SL) and low-rank adaptation (LoRA) fine-tuning, for efficiently fine-tuning LLMs on heterogeneous client devices. HSplitLoRA first identifies important weights based on their contributions to LLM training. It then dynamically configures the decomposition ranks of LoRA adapters for selected weights and determines the model split point according to varying computing budgets of client devices. Finally, a noise-free adapter aggregation mechanism is devised to support heterogeneous adapter aggregation without introducing noise. Extensive experiments demonstrate that HSplitLoRA outperforms state-of-the-art benchmarks in training accuracy and convergence speed.

摘要

近年来,大型语言模型(LLMs)取得了显著突破,彻底改变了自然语言处理领域及其他相关领域。由于参数量庞大,利用私有数据对这些模型进行微调以适配多样化下游任务已成为主流方法。尽管联邦学习(FL)为在不共享原始数据的情况下微调LLMs提供了可行方案,但高昂的计算成本阻碍了其普及化应用。此外,现实场景中私有客户端设备通常具有异构计算资源,这进一步增加了LLM微调的复杂性。为应对这些挑战,我们提出HSplitLoRA——一种基于分割学习(SL)和低秩自适应(LoRA)微调的异构高效参数微调(PEFT)框架,可在异构客户端设备上高效微调LLMs。该框架首先根据权重对LLM训练的贡献度识别重要权重,随后针对选定权重动态配置LoRA适配器的分解秩,并根据客户端设备的不同计算预算确定模型分割点。最后设计了一种无噪声适配器聚合机制,可在不引入噪声的情况下支持异构适配器聚合。大量实验表明,HSplitLoRA在训练精度和收敛速度方面均优于当前最先进的基准方法。


GenAINet: Enabling Wireless Collective Intelligence via Knowledge Transfer and Reasoning

Abstract

arXiv:2402.16631v3 Announce Type: replace Abstract: Generative Artificial Intelligence (GenAI) and communication networks are expected to have groundbreaking synergies for 6G. Connecting GenAI agents via a wireless network can potentially unleash the power of Collective Intelligence (CI) and pave the way for Artificial General Intelligence (AGI). However, current wireless networks are designed as a "data pipe" and are not suited to accommodate and leverage the power of GenAI. In this paper, we propose the GenAINet framework in which distributed GenAI agents communicate knowledge (facts, experiences, and methods) to accomplish arbitrary tasks. We first propose an architecture for a single GenAI agent and then provide a network architecture integrating GenAI capabilities to manage both network protocols and applications. Building on this, we investigate effective communication and reasoning problems by proposing a semantic-native GenAINet. Specifically, GenAI agents extract semantics from heterogeneous raw data, build and maintain a knowledge model representing the semantic relationships among pieces of knowledge, which is retrieved by GenAI models for planning and reasoning. Under this paradigm, different levels of collaboration can be achieved flexibly depending on the complexity of targeted tasks. Furthermore, we conduct two case studies in which, through wireless device queries, we demonstrate that extracting, compressing and transferring common knowledge can improve query accuracy while reducing communication costs; and in the wireless power control problem, we show that distributed agents can complete general tasks independently through collaborative reasoning without predefined communication protocols. Finally, we discuss challenges and future research directions in applying Large Language Models (LLMs) in 6G networks.

摘要

生成式人工智能(GenAI)与通信网络的融合预计将为6G带来突破性协同效应。通过无线网络连接GenAI智能体,有望释放集体智能(CI)的潜力,并为通用人工智能(AGI)的发展铺平道路。然而,现有无线网络被设计为"数据管道",难以适配并发挥GenAI的效能。本文提出GenAINet框架,通过分布式GenAI智能体间的知识(事实、经验与方法)交互来完成任意任务。我们首先构建单体GenAI智能体架构,继而提出整合GenAI能力的网络架构以同时管理网络协议与应用。在此基础上,通过构建语义原生的GenAINet,我们研究了高效通信与推理问题:GenAI智能体从异构原始数据中提取语义,建立并维护表征知识间语义关系的知识模型,供GenAI模型检索以进行规划推理。该范式可根据目标任务的复杂度灵活实现不同层级的协作。通过两个案例研究验证:在无线设备查询中,提取、压缩和传输公共知识可提升查询准确率并降低通信开销;在无线功率控制问题中,分布式智能体无需预定义通信协议即可通过协作推理独立完成通用任务。最后,我们探讨了大规模语言模型(LLM)在6G网络中应用面临的挑战与未来研究方向。


Balancing Pipeline Parallelism with Vocabulary Parallelism

Abstract

arXiv:2411.05288v2 Announce Type: replace Abstract: Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the vocabulary layers can cause imbalanced computation and memory usage across pipeline stages, worsening pipeline bubbles and the memory bottleneck. To tackle this, we partition the vocabulary layers evenly across pipeline devices and group the computation into pipeline passes. To reduce the activation memory overhead, we propose several algorithms to reduce communication barriers within vocabulary layers. Additionally, we utilize a generalizable method to integrate Vocabulary Parallelism with existing pipeline schedules. By combining these techniques, our methods effectively balance the computation and parameter memory, with only a small constant activation memory overhead. Notably, when combined with activation memory-balanced schedules like V-Half, our approach achieves perfect balance in both memory and computation. Extensive evaluations demonstrate that our method achieves computation and memory balance regardless of the vocabulary size, resulting in a 5% to 51% improvement in throughput compared to naive approaches, meanwhile significantly reducing peak memory usage especially for large vocabulary scenarios. Our implementation is open-sourced at https://github.com/sail-sg/VocabularyParallelism .

摘要

流水线并行技术被广泛用于扩展基于Transformer的大语言模型训练,已有诸多研究致力于提升其吞吐量和内存效率。本文针对一个常被忽视的问题展开研究:词汇层会导致流水线各阶段的计算与内存使用不均衡,加剧流水线气泡和内存瓶颈。为解决该问题,我们将词汇层均匀划分到流水线设备上,并将计算分组为流水线传递。为降低激活内存开销,我们提出多种算法以减少词汇层内部的通信障碍。此外,采用通用化方法将词汇并行与现有流水线调度方案集成。通过结合这些技术,我们的方法在仅引入少量恒定激活内存开销的前提下,有效平衡了计算与参数内存。特别地,当与V-Half等内存平衡调度方案结合时,可同时实现内存与计算的完美均衡。大量实验表明,本方案不受词汇量大小影响,始终维持计算与内存平衡,相比原始方法可获得5%至51%的吞吐量提升,同时在大词汇量场景下显著降低峰值内存使用。项目代码已开源:https://github.com/sail-sg/VocabularyParallelism。


DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference

Abstract

arXiv:2501.10375v2 Announce Type: replace Abstract: Mixture-of-Experts (MoE) models, though highly effective for various machine learning tasks, face significant deployment challenges on memory-constrained devices. While GPUs offer fast inference, their limited memory compared to CPUs means not all experts can be stored on the GPU simultaneously, necessitating frequent, costly data transfers from CPU memory, often negating GPU speed advantages. To address this, we present DAOP, an on-device MoE inference engine to optimize parallel GPU-CPU execution. DAOP dynamically allocates experts between CPU and GPU based on per-sequence activation patterns, and selectively pre-calculates predicted experts on CPUs to minimize transfer latency. This approach enables efficient resource utilization across various expert cache ratios while maintaining model accuracy through a novel graceful degradation mechanism. Comprehensive evaluations across various datasets show that DAOP outperforms traditional expert caching and prefetching methods by up to 8.20x and offloading techniques by 1.35x while maintaining accuracy.

摘要

混合专家模型(MoE)虽然在各类机器学习任务中表现卓越,但在内存受限设备上的部署面临重大挑战。尽管GPU能提供快速推理能力,但其内存容量较CPU更为有限,导致无法将所有专家模型同时存储在GPU上,不得不频繁从CPU内存进行高成本的数据传输,这往往抵消了GPU的速度优势。为解决这一问题,我们提出了DAOP——一种面向设备的MoE推理引擎,用于优化GPU-CPU并行执行。DAOP根据序列级激活模式动态分配CPU与GPU间的专家模型,并选择性预计算CPU上的预测专家以最小化传输延迟。该方法通过新型的优雅降级机制,在保持模型精度的同时,实现了不同专家缓存比率下的高效资源利用。跨数据集的综合评估表明,DAOP在保持准确性的前提下,比传统专家缓存与预取方法性能提升最高达8.20倍,较卸载技术快1.35倍。


Recursive Inference Scaling: A Winning Path to Scalable Inference in Language and Multimodal Systems

Abstract

arXiv:2502.07503v3 Announce Type: replace Abstract: Inspired by recent findings on the fractal geometry of language, we introduce Recursive INference Scaling (RINS) as a complementary, plug-in recipe for scaling inference time in language and multimodal systems. RINS is a particular form of recursive depth that significantly outperforms +55 other variants, including the recent "repeat-all-over" (RAO) strategy in Mobile LLM (Liu et al., 2024) and latent recurrent thinking (Geiping et al., 2025). Unlike prior works, we carry out our comparisons on a compute-matched regime, and demonstrate that for a fixed model size and training compute budget, RINS substantially improves language modeling performance. It also generalizes beyond pure language tasks, delivering gains in multimodal systems, including a +2% improvement in 0-shot ImageNet accuracy for SigLIP-B/16. Additionally, by deriving data scaling laws, we show that RINS improves both the asymptotic performance limits and the scaling exponents. More importantly, with light-weight (linear) adapters (comprising <1% of model parameters) and stochastic dropout, RINS offers a no-regret strategy, meaning that RINS-enabled pretraining improves performance in language modeling even when recursive depth is not applied at inference time. This corresponds to improving performance on a training compute-, parameter-, and inference-matched regime, suggesting its potential as a viable component of LLM pretraining!

摘要

受语言分形几何最新发现的启发,我们提出递归推理缩放(RINS)作为语言和多模态系统中扩展推理时间的补充性插件方案。RINS是一种特殊形式的递归深度策略,其性能显著优于包括Mobile LLM中的"全重复"(RAO)策略(Liu等人,2024)和潜在循环思维(Geiping等人,2025)在内的55种以上变体。与先前研究不同,我们在计算匹配机制下进行对比实验,证明在固定模型规模和训练计算预算条件下,RINS能大幅提升语言建模性能。该方法还可泛化至纯语言任务之外,在多模态系统中实现性能提升,包括使SigLIP-B/16在ImageNet上的零样本准确率提高2%。通过推导数据缩放规律,我们发现RINS同时改进了渐近性能极限和缩放指数。更重要的是,通过采用轻量级(线性)适配器(占模型参数<1%)和随机丢弃技术,RINS提供了无悔策略——即使推理时不应用递归深度,启用RINS的预训练仍能提升语言建模性能。这意味着该方法在训练计算量、参数量和推理成本相匹配的条件下仍能提升性能,表明其有望成为大语言模型预训练的有效组件。


SPD: Sync-Point Drop for efficient tensor parallelism of Large Language Models

Abstract

arXiv:2502.20727v2 Announce Type: replace Abstract: With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency. Therefore, we introduce a novel optimization technique, Sync-Point Drop (SPD), to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs. In detail, we first propose a block design that allows execution to proceed without communication through SPD. Second, we apply different SPD strategies to attention blocks based on their sensitivity to the model accuracy. The proposed methods effectively alleviate communication bottlenecks while minimizing accuracy degradation during LLM inference, offering a scalable solution for diverse distributed environments: SPD offered about 20% overall inference latency reduction with < 1% accuracy regression for LLaMA2-70B inference over 8 GPUs.

摘要

随着大型语言模型(LLMs)规模的快速扩张,实现跨多个计算单元的高效分布式推理变得愈发关键。然而,诸如张量并行等主流分布式推理技术带来的通信开销,对实现可扩展性和低延迟构成了重大挑战。为此,我们提出了一种新颖的优化技术——同步点丢弃(SPD),通过选择性忽略注意力输出的同步来降低张量并行中的通信开销。具体而言,我们首先设计了一种支持通过SPD实现无通信执行的模块架构;其次,我们根据注意力模块对模型精度的敏感程度应用不同的SPD策略。所提方法有效缓解了通信瓶颈,同时将LLM推理过程中的精度损失降至最低,为多样化分布式环境提供了可扩展解决方案:在8块GPU上运行LLaMA2-70B推理时,SPD实现了约20%的整体推理延迟降低,且精度损失小于1%。


The Effectiveness of Large Language Models in Transforming Unstructured Text to Standardized Formats

Abstract

arXiv:2503.02650v2 Announce Type: replace Abstract: The exponential growth of unstructured text data presents a fundamental challenge in modern data management and information retrieval. While Large Language Models (LLMs) have shown remarkable capabilities in natural language processing, their potential to transform unstructured text into standardized, structured formats remains largely unexplored - a capability that could revolutionize data processing workflows across industries. This study breaks new ground by systematically evaluating LLMs' ability to convert unstructured recipe text into the structured Cooklang format. Through comprehensive testing of four models (GPT-4o, GPT-4o-mini, Llama3.1:70b, and Llama3.1:8b), an innovative evaluation approach is introduced that combines traditional metrics (WER, ROUGE-L, TER) with specialized metrics for semantic element identification. Our experiments reveal that GPT-4o with few-shot prompting achieves breakthrough performance (ROUGE-L: 0.9722, WER: 0.0730), demonstrating for the first time that LLMs can reliably transform domain-specific unstructured text into structured formats without extensive training. Although model performance generally scales with size, we uncover surprising potential in smaller models like Llama3.1:8b for optimization through targeted fine-tuning. These findings open new possibilities for automated structured data generation across various domains, from medical records to technical documentation, potentially transforming the way organizations process and utilize unstructured information.

摘要

非结构化文本数据的指数级增长给现代数据管理和信息检索带来了根本性挑战。尽管大语言模型(LLMs)在自然语言处理方面展现出卓越能力,但其将非结构化文本转化为标准化结构化格式的潜力——这一可能彻底改变各行业数据处理流程的能力——仍鲜有研究。本研究通过系统评估LLMs将非结构化食谱文本转换为结构化Cooklang格式的能力取得突破性进展。通过对四种模型(GPT-4o、GPT-4o-mini、Llama3.1:70b和Llama3.1:8b)的全面测试,我们提出了一种创新评估方法,将传统指标(WER、ROUGE-L、TER)与语义元素识别的专项指标相结合。实验表明,采用少量示例提示的GPT-4o实现了突破性性能(ROUGE-L:0.9722,WER:0.0730),首次证明LLMs无需大量训练即可可靠地将领域特异性非结构化文本转化为结构化格式。虽然模型性能通常随规模提升,但我们发现Llama3.1:8b等较小模型通过针对性微调具有惊人优化潜力。这些发现为从医疗记录到技术文档等各领域的自动化结构化数据生成开辟了新途径,可能彻底改变组织处理与利用非结构化信息的方式。


Activation Space Interventions Can Be Transferred Between Large Language Models

Abstract

arXiv:2503.04429v2 Announce Type: replace Abstract: The study of representation universality in AI models reveals growing convergence across domains, modalities, and architectures. However, the practical applications of representation universality remain largely unexplored. We bridge this gap by demonstrating that safety interventions can be transferred between models through learned mappings of their shared activation spaces. We demonstrate this approach on two well-established AI safety tasks: backdoor removal and refusal of harmful prompts, showing successful transfer of steering vectors that alter the models' outputs in a predictable way. Additionally, we propose a new task, \textit{corrupted capabilities}, where models are fine-tuned to embed knowledge tied to a backdoor. This tests their ability to separate useful skills from backdoors, reflecting real-world challenges. Extensive experiments across Llama, Qwen and Gemma model families show that our method enables using smaller models to efficiently align larger ones. Furthermore, we demonstrate that autoencoder mappings between base and fine-tuned models can serve as reliable ``lightweight safety switches", allowing dynamic toggling between model behaviors.

摘要

人工智能模型表征普适性的研究揭示了跨领域、跨模态和跨架构的日益趋同现象。然而,表征普适性的实际应用仍鲜有探索。我们通过证明安全干预措施可通过学习共享激活空间的映射在模型间传递,从而弥合这一研究空白。本方法在两个成熟的人工智能安全任务中得到验证:后门消除与有害提示拒绝,实验表明调控向量能成功转移并以可预测方式改变模型输出。此外,我们提出名为\textit{能力污染}的新任务,该任务通过微调模型将知识嵌入与后门绑定,测试模型区分有用技能与后门的能力,以反映现实挑战。基于Llama、Qwen和Gemma模型系列的广泛实验表明,本方法可利用小模型高效对齐大模型。进一步研究发现,基础模型与微调模型间的自编码器映射可充当可靠的"轻量级安全开关",实现模型行为的动态切换。


Large Language Models at Work in China's Labor Market

Abstract

arXiv:2308.08776v2 Announce Type: replace-cross Abstract: This paper explores the potential impacts of large language models (LLMs) on the Chinese labor market. We analyze occupational exposure to LLM capabilities by incorporating human expertise and LLM classifications, following the methodology of Eloundou et al. (2023). The results indicate a positive correlation between occupational exposure and both wage levels and experience premiums at the occupation level. This suggests that higher-paying and experience-intensive jobs may face greater exposure risks from LLM-powered software. We then aggregate occupational exposure at the industry level to obtain industrial exposure scores. Both occupational and industrial exposure scores align with expert assessments. Our empirical analysis also demonstrates a distinct impact of LLMs, which deviates from the routinization hypothesis. We present a stylized theoretical framework to better understand this deviation from previous digital technologies. By incorporating entropy-based information theory into the task-based framework, we propose an AI learning theory that reveals a different pattern of LLM impacts compared to the routinization hypothesis.

摘要

本文探讨了大型语言模型(LLMs)对中国劳动力市场的潜在影响。我们借鉴Eloundou等人(2023)的研究方法,通过整合人类专业评估与LLM分类能力,分析了职业暴露于LLM影响的程度。研究结果显示,职业暴露程度与职业层面的工资水平及经验溢价呈正相关关系,这表明高薪酬和经验密集型岗位可能面临更大的LLM驱动软件带来的暴露风险。我们进一步将职业暴露数据聚合至行业层面,获得行业暴露评分。无论是职业还是行业层面的暴露评分,都与专家评估结果保持一致。实证分析还揭示了LLMs的影响模式明显不同于常规化假说。为此,我们构建了一个理论框架模型,通过将基于熵的信息论融入任务导向框架,提出了AI学习理论,该理论呈现出与常规化假说截然不同的LLM影响模式。


SMUTF: Schema Matching Using Generative Tags and Hybrid Features

Abstract

arXiv:2402.01685v3 Announce Type: replace-cross Abstract: We introduce SMUTF (Schema Matching Using Generative Tags and Hybrid Features), a unique approach for large-scale tabular data schema matching (SM), which assumes that supervised learning does not affect performance in open-domain tasks, thereby enabling effective cross-domain matching. This system uniquely combines rule-based feature engineering, pre-trained language models, and generative large language models. In an innovative adaptation inspired by the Humanitarian Exchange Language, we deploy "generative tags" for each data column, enhancing the effectiveness of SM. SMUTF exhibits extensive versatility, working seamlessly with any pre-existing pre-trained embeddings, classification methods, and generative models. Recognizing the lack of extensive, publicly available datasets for SM, we have created and open-sourced the HDXSM dataset from the public humanitarian data. We believe this to be the most exhaustive SM dataset currently available. In evaluations across various public datasets and the novel HDXSM dataset, SMUTF demonstrated exceptional performance, surpassing existing state-of-the-art models in terms of accuracy and efficiency, and improving the F1 score by 11.84% and the AUC of ROC by 5.08%. Code is available at https://github.com/fireindark707/Python-Schema-Matching.

摘要

我们提出SMUTF(基于生成标签与混合特征的模式匹配方法),这是一种用于大规模表格数据模式匹配(SM)的创新方法,其假设监督学习不会影响开放领域任务的性能,从而实现有效的跨域匹配。该系统独创性地结合了基于规则的特征工程、预训练语言模型和生成式大语言模型。受人道主义交换语言的启发,我们创新性地为每个数据列部署"生成标签",显著提升了模式匹配的效能。SMUTF展现出广泛的适用性,可与任何现有预训练嵌入模型、分类方法及生成模型无缝协作。

针对当前缺乏公开大规模模式匹配数据集的问题,我们从公共人道主义数据中创建并开源了HDXSM数据集。据我们所知,这是目前最全面的模式匹配数据集。在多个公共数据集及新型HDXSM数据集上的评估表明,SMUTF表现出卓越性能,在准确率和效率方面均超越现有最先进模型,将F1分数提升了11.84%,ROC曲线的AUC值提高了5.08%。代码详见https://github.com/fireindark707/Python-Schema-Matching。


DECIDER: A Dual-System Rule-Controllable Decoding Framework for Language Generation

Abstract

arXiv:2403.01954v4 Announce Type: replace-cross Abstract: Constrained decoding approaches aim to control the meaning or style of text generated by the pre-trained large language models (LLMs or also PLMs) for various tasks at inference time. However, these methods often guide plausible continuations by greedily and explicitly selecting targets. Though fulfilling the task requirements, these methods may overlook certain general and natural logics that humans would implicitly follow towards such targets. Inspired by cognitive dual-process theory, in this work, we propose a novel decoding framework DECIDER where the base LLMs are equipped with a First-Order Logic (FOL) reasoner to express and evaluate the rules, along with a decision function that merges the outputs of both systems to guide the generation. Unlike previous constrained decodings, DECIDER transforms the encouragement of target-specific words into all words that satisfy several high-level rules, enabling us to programmatically integrate our logic into LLMs. Experiments on CommonGen and PersonaChat demonstrate that DECIDER effectively follows given FOL rules to guide LLMs in a more human-like and logic-controlled manner.

摘要

约束解码方法旨在推理阶段控制预训练大语言模型(LLMs或PLMs)生成文本的语义或风格以适应不同任务。然而这些方法通常通过贪婪且显式地选择目标词来引导合理续写,虽满足任务要求,却可能忽略人类在实现这类目标时隐含遵循的某些通用自然逻辑。受认知双过程理论启发,本研究提出新型解码框架DECIDER:基础LLM配备一阶逻辑(FOL)推理器用于规则表达与评估,并通过决策函数融合两个系统的输出来引导生成。与传统约束解码不同,DECIDER将特定目标词的激励转化为满足若干高层规则的所有词汇,使我们能以可编程方式将逻辑规则融入LLM。在CommonGen和PersonaChat上的实验表明,DECIDER能有效遵循给定FOL规则,以更拟人化且逻辑可控的方式引导大语言模型生成。


RiskLabs: Predicting Financial Risk Using Large Language Model based on Multimodal and Multi-Sources Data

Abstract

arXiv:2404.07452v2 Announce Type: replace-cross Abstract: The integration of Artificial Intelligence (AI) techniques, particularly large language models (LLMs), in finance has garnered increasing academic attention. Despite progress, existing studies predominantly focus on tasks like financial text summarization, question-answering, and stock movement prediction (binary classification), the application of LLMs to financial risk prediction remains underexplored. Addressing this gap, in this paper, we introduce RiskLabs, a novel framework that leverages LLMs to analyze and predict financial risks. RiskLabs uniquely integrates multimodal financial data, including textual and vocal information from Earnings Conference Calls (ECCs), market-related time series data, and contextual news data to improve financial risk prediction. Empirical results demonstrate RiskLabs' effectiveness in forecasting both market volatility and variance. Through comparative experiments, we examine the contributions of different data sources to financial risk assessment and highlight the crucial role of LLMs in this process. We also discuss the challenges associated with using LLMs for financial risk prediction and explore the potential of combining them with multimodal data for this purpose.

摘要

人工智能(AI)技术,特别是大语言模型(LLMs)在金融领域的融合应用日益受到学术界关注。尽管研究已取得进展,现有成果主要集中于金融文本摘要、问答系统和股票涨跌预测(二分类)等任务,而LLMs在金融风险预测中的应用仍待探索。针对这一空白,本文提出RiskLabs创新框架,通过LLMs实现金融风险分析与预测。该框架独特地整合了多模态金融数据,包括财报电话会议(ECCs)的文本与语音信息、市场相关时间序列数据以及情境新闻数据,以提升金融风险预测性能。实证结果表明RiskLabs能有效预测市场波动率与方差。通过对比实验,我们验证了不同数据源对金融风险评估的贡献,并揭示LLMs在此过程中的关键作用。同时,本文探讨了LLMs应用于金融风险预测的挑战,并研究了其与多模态数据结合的应用潜力。


Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models

Abstract

arXiv:2405.03869v5 Announce Type: replace-cross Abstract: A core data-centric learning challenge is the identification of training samples that are detrimental to model performance. Influence functions serve as a prominent tool for this task and offer a robust framework for assessing training data influence on model predictions. Despite their widespread use, their high computational cost associated with calculating the inverse of the Hessian matrix pose constraints, particularly when analyzing large-sized deep models. In this paper, we establish a bridge between identifying detrimental training samples via influence functions and outlier gradient detection. This transformation not only presents a straightforward and Hessian-free formulation but also provides insights into the role of the gradient in sample impact. Through systematic empirical evaluations, we first validate the hypothesis of our proposed outlier gradient analysis approach on synthetic datasets. We then demonstrate its effectiveness in detecting mislabeled samples in vision models and selecting data samples for improving performance of natural language processing transformer models. We also extend its use to influential sample identification for fine-tuning Large Language Models.

摘要

以数据为核心的学习过程中,识别损害模型性能的训练样本是一项关键挑战。影响函数作为解决该任务的重要工具,为评估训练数据对模型预测的影响提供了稳健框架。尽管应用广泛,但该方法因需计算海森矩阵逆矩阵而存在高计算成本问题,尤其在分析大规模深度模型时更为突出。本文在基于影响函数的有害训练样本识别与异常梯度检测之间建立了理论桥梁。这种转化不仅提出了无需海森矩阵的简洁公式,还揭示了梯度在样本影响中的作用机制。通过系统化实验评估,我们首先在合成数据集上验证了所提异常梯度分析方法的假设,随后证明了其在视觉模型中检测错误标记样本、以及为提升自然语言处理Transformer模型性能筛选数据样本方面的有效性。该方法还可扩展应用于大语言模型微调过程中的影响力样本识别。


From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences

Abstract

arXiv:2405.05572v2 Announce Type: replace-cross Abstract: Current computational approaches for analysing or generating code-mixed sentences do not explicitly model naturalness'' or acceptability'' of code-mixed sentences, but rely on training corpora to reflect distribution of acceptable code-mixed sentences. Modelling human judgement for the acceptability of code-mixed text can help in distinguishing natural code-mixed text and enable quality-controlled generation of code-mixed text. To this end, we construct Cline - a dataset containing human acceptability judgements for English-Hindi~(en-hi) code-mixed text. Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources: synthetically generated code-mixed text and samples collected from online social media. Our analysis establishes that popular code-mixing metrics such as CMI, Number of Switch Points, Burstines, which are used to filter/curate/compare code-mixed corpora have low correlation with human acceptability judgements, underlining the necessity of our dataset. Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models when trained solely using code-mixing metrics as features are outperformed by fine-tuned pre-trained Multilingual Large Language Models (MLLMs). Specifically, among Encoder models XLM-Roberta and Bernice outperform IndicBERT across different configurations. Among Encoder-Decoder models, mBART performs better than mT5, however Encoder-Decoder models are not able to outperform Encoder-only models. Decoder-only models perform the best when compared to all other MLLMS, with Llama 3.2 - 3B models outperforming similarly sized Qwen, Phi models. Comparison with zero and fewshot capabilitites of ChatGPT show that MLLMs fine-tuned on larger data outperform ChatGPT, providing scope for improvement in code-mixed tasks. Zero-shot transfer from En-Hi to En-Te acceptability judgments are better than random baselines.

摘要

当前用于分析或生成语码混合句子的计算方法并未明确建模其"自然度"或"可接受性",而是依赖训练语料库来反映可接受语码混合句子的分布。建立针对语码混合文本可接受性的人类判断模型,有助于区分自然语码混合文本并实现质量可控的语码混合文本生成。为此,我们构建了Cline数据集——包含英语-印地语(en-hi)语码混合文本人类可接受性标注的数据集。Cline作为同类最大规模数据集,包含16,642个句子,样本来源包括:合成生成的语码混合文本和社交媒体采集的真实样本。分析表明,CMI、转换点数量、Burstiness等常用于筛选/整理/比较语码混合语料库的流行指标与人类可接受性判断相关性较低,这凸显了我们数据集的必要性。基于Cline的实验表明,仅使用语码混合指标作为特征训练的多层感知机(MLP)模型,其性能逊色于微调后的预训练多语种大语言模型(MLLM)。具体而言,在编码器模型中,XLM-Roberta和Bernice在不同配置下均优于IndicBERT;在编码器-解码器模型中,mBART表现优于mT5,但均未超越纯编码器模型。仅解码器模型在所有MLLM中表现最佳,其中Llama 3.2-3B模型优于同等规模的Qwen、Phi模型。与ChatGPT的零样本/小样本能力对比显示,基于更大数据微调的MLLM性能超越ChatGPT,为语码混合任务提供了改进空间。从En-Hi到En-Te可接受性判断的零样本迁移效果优于随机基线。


LASSI: An LLM-based Automated Self-Correcting Pipeline for Translating Parallel Scientific Codes

Abstract

arXiv:2407.01638v2 Announce Type: replace-cross Abstract: This paper addresses the problem of providing a novel approach to sourcing significant training data for LLMs focused on science and engineering. In particular, a crucial challenge is sourcing parallel scientific codes in the ranges of millions to billions of codes. To tackle this problem, we propose an automated pipeline framework called LASSI, designed to translate between parallel programming languages by bootstrapping existing closed- or open-source LLMs. LASSI incorporates autonomous enhancement through self-correcting loops where errors encountered during the compilation and execution of generated code are fed back to the LLM through guided prompting for debugging and refactoring. We highlight the bi-directional translation of existing GPU benchmarks between OpenMP target offload and CUDA to validate LASSI. The results of evaluating LASSI with different application codes across four LLMs demonstrate the effectiveness of LASSI for generating executable parallel codes, with 80% of OpenMP to CUDA translations and 85% of CUDA to OpenMP translations producing the expected output. We also observe approximately 78% of OpenMP to CUDA translations and 62% of CUDA to OpenMP translations execute within 10% of or at a faster runtime than the original benchmark code in the same language.

摘要

本文针对为专注于科学与工程的大语言模型(LLM)提供新型训练数据源的问题展开研究。核心挑战在于获取数百万至数十亿量级的并行科学代码。为解决该问题,我们提出名为LASSI的自动化流程框架,通过利用现有闭源或开源LLM实现并行编程语言间的相互转换。LASSI采用自主增强机制,通过自校正循环将代码编译执行过程中遇到的错误反馈给LLM,借助引导式提示进行调试和重构。我们通过实现OpenMP目标卸载与CUDA之间现有GPU基准测试的双向转换来验证LASSI。在四个LLM上使用不同应用代码的评估结果表明,LASSI生成可执行并行代码的有效性达到:OpenMP转CUDA成功率80%,CUDA转OpenMP成功率85%。同时观察到约78%的OpenMP转CUDA代码和62%的CUDA转OpenMP代码,其执行时间较原始基准代码相同语言版本快或在10%误差范围内。


A Logical Fallacy-Informed Framework for Argument Generation

Abstract

arXiv:2408.03618v4 Announce Type: replace-cross Abstract: Despite the remarkable performance of Large Language Models (LLMs) in natural language processing tasks, they still struggle with generating logically sound arguments, resulting in potential risks such as spreading misinformation. To address this issue, we introduce FIPO, a fallacy-informed framework that leverages preference optimization methods to steer LLMs toward logically sound arguments. FIPO includes a classification loss, to capture the fine-grained information on fallacy types. Our results on argumentation datasets show that our method reduces the fallacy errors by up to 17.5%. Furthermore, our human evaluation results indicate that the quality of the generated arguments by our method significantly outperforms the fine-tuned baselines, as well as other preference optimization methods, such as DPO. These findings highlight the importance of ensuring models are aware of logical fallacies for effective argument generation. Our code is available at github.com/lucamouchel/Logical-Fallacies.

摘要

尽管大语言模型(LLM)在自然语言处理任务中表现出色,但其在生成逻辑严谨的论点时仍存在困难,可能导致传播错误信息等潜在风险。为解决这一问题,我们提出FIPO框架——一种基于逻辑谬误识别的偏好优化方法,通过引导LLM生成逻辑严密的论点。该框架包含分类损失函数,用于捕捉细粒度的谬误类型信息。在论证数据集上的实验表明,我们的方法将谬误错误率最高降低17.5%。人工评估结果显示,本方法生成的论点质量显著优于微调基线及其他偏好优化方法(如DPO)。这些发现凸显了确保模型识别逻辑谬误对有效论点生成的重要性。代码已开源于github.com/lucamouchel/Logical-Fallacies。


Tele-LLMs: A Series of Specialized Large Language Models for Telecommunications

Abstract

arXiv:2409.05314v3 Announce Type: replace-cross Abstract: The emergence of large language models (LLMs) has significantly impacted various fields, from natural language processing to sectors like medicine and finance. However, despite their rapid proliferation, the applications of LLMs in telecommunications remain limited, often relying on general-purpose models that lack domain-specific specialization. This lack of specialization results in underperformance, particularly when dealing with telecommunications-specific technical terminology and their associated mathematical representations. This paper addresses this gap by first creating and disseminating Tele-Data, a comprehensive dataset of telecommunications material curated from relevant sources, and Tele-Eval, a large-scale question-and-answer dataset tailored to the domain. Through extensive experiments, we explore the most effective training techniques for adapting LLMs to the telecommunications domain, ranging from examining the division of expertise across various telecommunications aspects to employing parameter-efficient techniques. We also investigate how models of different sizes behave during adaptation and analyze the impact of their training data on this behavior. Leveraging these findings, we develop and open-source Tele-LLMs, the first series of language models ranging from 1B to 8B parameters, specifically tailored for telecommunications. Our evaluations demonstrate that these models outperform their general-purpose counterparts on Tele-Eval and telecommunications-related literature tasks while retaining their previously acquired capabilities, thus avoiding the catastrophic forgetting phenomenon.

摘要

大型语言模型(LLMs)的兴起对从自然语言处理到医学、金融等多个领域产生了深远影响。然而尽管其迅速普及,LLMs在电信领域的应用仍显不足,主要依赖缺乏领域专业性的通用模型。这种专业性的缺失导致模型表现欠佳,尤其在处理电信领域特有技术术语及其相关数学表征时更为明显。本文通过创建并开源Tele-Data(一个从相关资源整理的电信领域综合数据集)和Tele-Eval(针对该领域定制的大规模问答数据集)来填补这一空白。通过大量实验,我们探索了将LLMs适配至电信领域的最有效训练技术,包括研究电信各细分领域的专业知识划分,以及采用参数高效微调方法。我们还分析了不同规模模型在领域适配过程中的表现差异,并研究了其训练数据对此过程的影响。基于这些发现,我们开发并开源了Tele-LLMs系列模型——首个参数规模从10亿到80亿不等的电信专用语言模型。评估结果表明,这些模型在Tele-Eval测试及电信相关文献任务上表现优于通用模型,同时保留了原有能力,有效避免了灾难性遗忘现象。


ELOQ: Resources for Enhancing LLM Detection of Out-of-Scope Questions

Abstract

arXiv:2410.14567v4 Announce Type: replace-cross Abstract: Retrieval-augmented generation (RAG) has become integral to large language models (LLMs), particularly for conversational AI systems where user questions may reference knowledge beyond the LLMs' training cutoff. However, many natural user questions lack well-defined answers, either due to limited domain knowledge or because the retrieval system returns documents that are relevant in appearance but uninformative in content. In such cases, LLMs often produce hallucinated answers without flagging them. While recent work has largely focused on questions with false premises, we study out-of-scope questions, where the retrieved document appears semantically similar to the question but lacks the necessary information to answer it. In this paper, we propose a guided hallucination-based approach ELOQ to automatically generate a diverse set of out-of-scope questions from post-cutoff documents, followed by human verification to ensure quality. We use this dataset to evaluate several LLMs on their ability to detect out-of-scope questions and generate appropriate responses. Finally, we introduce an improved detection method that enhances the reliability of LLM-based question-answering systems in handling out-of-scope questions.

摘要

检索增强生成(RAG)已成为大语言模型(LLM)的关键组成部分,尤其适用于用户问题可能涉及模型训练截止时间后知识的对话式AI系统。然而,由于领域知识有限或检索系统返回看似相关但内容无实质信息的文档,许多自然用户问题缺乏明确答案。在此类情况下,LLM常生成虚假答案且未予以标识。尽管近期研究主要关注前提错误的问题,我们重点研究了检索文档与问题语义相似但缺乏必要信息的超范围问题。本文提出基于引导幻觉的方法ELOQ,通过从截止后文档自动生成多样化超范围问题集,并经过人工验证确保质量。利用该数据集,我们评估了多种LLM在检测超范围问题及生成恰当回应方面的能力。最后,我们提出一种改进的检测方法,可增强基于LLM的问答系统处理超范围问题的可靠性。


Large Language Model with Region-guided Referring and Grounding for CT Report Generation

Abstract

arXiv:2411.15539v2 Announce Type: replace-cross Abstract: Computed tomography (CT) report generation is crucial to assist radiologists in interpreting CT volumes, which can be time-consuming and labor-intensive. Existing methods primarily only consider the global features of the entire volume, making it struggle to focus on specific regions and potentially missing abnormalities. To address this issue, we propose Reg2RG, the first region-guided referring and grounding framework for CT report generation, which enhances diagnostic performance by focusing on anatomical regions within the volume. Specifically, we utilize masks from a universal segmentation module to capture local features for each referring region. A local feature decoupling (LFD) strategy is proposed to preserve the local high-resolution details with little computational overhead. Then the local features are integrated with global features to capture inter-regional relationships within a cohesive context. Moreover, we propose a novel region-report alignment (RRA) training strategy. It leverages the recognition of referring regions to guide the generation of region-specific reports, enhancing the model's referring and grounding capabilities while also improving the report's interpretability. A large language model (LLM) is further employed as the language decoder to generate reports from integrated visual features, facilitating region-level comprehension. Extensive experiments on two large-scale chest CT-report datasets demonstrate the superiority of our method, which outperforms several state-of-the-art methods in terms of both natural language generation and clinical efficacy metrics while preserving promising interpretability. The code is available at https://github.com/zhi-xuan-chen/Reg2RG.

摘要

计算机断层扫描(CT)报告生成对于协助放射科医生解读CT影像至关重要,但这一过程通常耗时耗力。现有方法主要仅考虑整个影像的全局特征,导致难以聚焦特定区域并可能遗漏异常病变。为解决该问题,我们提出首个区域引导的参照与定位框架Reg2RG,通过关注影像中的解剖区域来提升诊断性能。具体而言,我们利用通用分割模块生成的掩码捕获每个参照区域的局部特征,并提出局部特征解耦(LFD)策略,以极低计算开销保留局部高分辨率细节。随后将局部特征与全局特征融合,在连贯上下文中捕捉区域间关联。此外,我们提出新颖的区域-报告对齐(RRA)训练策略,通过识别参照区域来引导生成区域特异性报告,既增强模型的参照定位能力,又提升报告可解释性。进一步采用大语言模型(LLM)作为语言解码器,从融合视觉特征生成报告,实现区域级理解。在两个大规模胸部CT-报告数据集上的实验表明,本方法在自然语言生成和临床效能指标上均超越多种先进方法,同时保持优异的可解释性。代码已开源:https://github.com/zhi-xuan-chen/Reg2RG。


KG-Retriever: Efficient Knowledge Indexing for Retrieval-Augmented Large Language Models

Abstract

arXiv:2412.05547v2 Announce Type: replace-cross Abstract: Large language models with retrieval-augmented generation encounter a pivotal challenge in intricate retrieval tasks, e.g., multi-hop question answering, which requires the model to navigate across multiple documents and generate comprehensive responses based on fragmented information. To tackle this challenge, we introduce a novel Knowledge Graph-based RAG framework with a hierarchical knowledge retriever, termed KG-Retriever. The retrieval indexing in KG-Retriever is constructed on a hierarchical index graph that consists of a knowledge graph layer and a collaborative document layer. The associative nature of graph structures is fully utilized to strengthen intra-document and inter-document connectivity, thereby fundamentally alleviating the information fragmentation problem and meanwhile improving the retrieval efficiency in cross-document retrieval of LLMs. With the coarse-grained collaborative information from neighboring documents and concise information from the knowledge graph, KG-Retriever achieves marked improvements on five public QA datasets, showing the effectiveness and efficiency of our proposed RAG framework.

摘要

采用检索增强生成技术的大语言模型在复杂检索任务(如多跳问答)中面临关键挑战,此类任务要求模型跨越多个文档进行导航,并基于碎片化信息生成全面回答。为解决这一挑战,我们提出了一种基于知识图谱的新型RAG框架——KG-Retriever,其配备分层知识检索器。该框架的检索索引构建于由知识图谱层和协作文档层组成的分层索引图上,充分利用图结构的关联特性增强文档内与文档间的连接性,从而从根本上缓解信息碎片化问题,同时提升大语言模型跨文档检索的效率。通过整合邻近文档的粗粒度协作信息及知识图谱的简明信息,KG-Retriever在五个公开问答数据集上实现显著性能提升,验证了所提RAG框架的有效性与高效性。


BrushEdit: All-In-One Image Inpainting and Editing

Abstract

arXiv:2412.10316v3 Announce Type: replace-cross Abstract: Image editing has advanced significantly with the development of diffusion models using both inversion-based and instruction-based methods. However, current inversion-based approaches struggle with big modifications (e.g., adding or removing objects) due to the structured nature of inversion noise, which hinders substantial changes. Meanwhile, instruction-based methods often constrain users to black-box operations, limiting direct interaction for specifying editing regions and intensity. To address these limitations, we propose BrushEdit, a novel inpainting-based instruction-guided image editing paradigm, which leverages multimodal large language models (MLLMs) and image inpainting models to enable autonomous, user-friendly, and interactive free-form instruction editing. Specifically, we devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model in an agent-cooperative framework to perform editing category classification, main object identification, mask acquisition, and editing area inpainting. Extensive experiments show that our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics including mask region preservation and editing effect coherence.

摘要

随着基于反转和基于指令的扩散模型发展,图像编辑技术取得了显著进展。然而,当前基于反转的方法由于反转噪声的结构化特性,难以实现大幅修改(如添加或移除对象),阻碍了实质性变更。与此同时,基于指令的方法通常将用户限制在黑箱操作中,难以直接指定编辑区域和强度。为解决这些局限,我们提出BrushEdit——一种基于修复的指令引导图像编辑新范式,通过结合多模态大语言模型(MLLMs)与图像修复模型,实现自主、用户友好且交互式的自由指令编辑。具体而言,我们设计了一个在智能体协作框架下集成MLLMs与双分支图像修复模型的系统,可自主执行编辑分类识别、主体对象定位、掩膜获取及编辑区域修复等任务。大量实验表明,该框架有效融合了MLLMs与修复模型,在掩膜区域保持度和编辑效果连贯性等七项指标上均表现出优越性能。


AD-LLM: Benchmarking Large Language Models for Anomaly Detection

Abstract

arXiv:2412.11142v2 Announce Type: replace-cross Abstract: Anomaly detection (AD) is an important machine learning task with many real-world uses, including fraud detection, medical diagnosis, and industrial monitoring. Within natural language processing (NLP), AD helps detect issues like spam, misinformation, and unusual user activity. Although large language models (LLMs) have had a strong impact on tasks such as text generation and summarization, their potential in AD has not been studied enough. This paper introduces AD-LLM, the first benchmark that evaluates how LLMs can help with NLP anomaly detection. We examine three key tasks: (i) zero-shot detection, using LLMs' pre-trained knowledge to perform AD without tasks-specific training; (ii) data augmentation, generating synthetic data and category descriptions to improve AD models; and (iii) model selection, using LLMs to suggest unsupervised AD models. Through experiments with different datasets, we find that LLMs can work well in zero-shot AD, that carefully designed augmentation methods are useful, and that explaining model selection for specific datasets remains challenging. Based on these results, we outline six future research directions on LLMs for AD.

摘要

异常检测(AD)作为机器学习的重要任务,在欺诈检测、医疗诊断和工业监测等现实场景中具有广泛应用。在自然语言处理(NLP)领域,该技术可有效识别垃圾邮件、错误信息和异常用户行为等问题。尽管大语言模型(LLMs)在文本生成与摘要等任务中表现卓越,但其在异常检测中的潜力尚未得到充分研究。本文提出首个评估LLMs辅助NLP异常检测能力的基准框架AD-LLM,重点研究三个核心任务:(i)零样本检测——利用LLMs的预训练知识实现免训练的异常识别;(ii)数据增强——通过生成合成数据与类别描述提升检测模型性能;(iii)模型选择——借助LLMs推荐无监督异常检测模型。基于多数据集实验表明:LLMs在零样本异常检测中表现优异,精心设计的增强方法效果显著,但针对特定数据集的模型选择解释仍具挑战性。根据研究结果,我们进一步提出LLMs用于异常检测的六个未来研究方向。


ELECTRA and GPT-4o: Cost-Effective Partners for Sentiment Analysis

Abstract

arXiv:2501.00062v2 Announce Type: replace-cross Abstract: Bidirectional transformers excel at sentiment analysis, and Large Language Models (LLM) are effective zero-shot learners. Might they perform better as a team? This paper explores collaborative approaches between ELECTRA and GPT-4o for three-way sentiment classification. We fine-tuned (FT) four models (ELECTRA Base/Large, GPT-4o/4o-mini) using a mix of reviews from Stanford Sentiment Treebank (SST) and DynaSent. We provided input from ELECTRA to GPT as: predicted label, probabilities, and retrieved examples. Sharing ELECTRA Base FT predictions with GPT-4o-mini significantly improved performance over either model alone (82.50 macro F1 vs. 79.14 ELECTRA Base FT, 79.41 GPT-4o-mini) and yielded the lowest cost/performance ratio ($0.12/F1 point). However, when GPT models were fine-tuned, including predictions decreased performance. GPT-4o FT-M was the top performer (86.99), with GPT-4o-mini FT close behind (86.70) at much less cost ($0.38 vs. $1.59/F1 point). Our results show that augmenting prompts with predictions from fine-tuned encoders is an efficient way to boost performance, and a fine-tuned GPT-4o-mini is nearly as good as GPT-4o FT at 76% less cost. Both are affordable options for projects with limited resources.

摘要

双向Transformer在情感分析中表现优异,而大语言模型(LLM)是高效的零样本学习者。二者协作能否实现更优性能?本文探究了ELECTRA与GPT-4o在三元情感分类中的协同方法。我们使用斯坦福情感树库(SST)和DynaSent的混合评论数据,对四种模型(ELECTRA Base/Large、GPT-4o/4o-mini)进行微调(FT)。将ELECTRA的预测标签、概率及检索示例作为输入提供给GPT模型。实验表明:共享ELECTRA Base FT的预测结果使GPT-4o-mini性能显著超越单模型(宏F1值82.50 vs. ELECTRA Base FT 79.14,GPT-4o-mini 79.41),且实现最低成本性能比(0.12美元/F1分)。但当GPT模型被微调时,引入预测反而降低性能。GPT-4o FT-M表现最佳(86.99),GPT-4o-mini FT以更低成本紧追其后(86.70 vs. 1.59美元/F1分仅需0.38美元)。结果表明:通过微调编码器的预测增强提示是提升性能的有效方式,且微调后的GPT-4o-mini能以76%的成本降幅达到接近GPT-4o FT的水平。二者均为资源受限项目提供了经济可行的选择方案。


Prompt-Based Cost-Effective Evaluation and Operation of ChatGPT as a Computer Programming Teaching Assistant

Abstract

arXiv:2501.17176v3 Announce Type: replace-cross Abstract: The dream of achieving a student-teacher ratio of 1:1 is closer than ever thanks to the emergence of large language models (LLMs). One potential application of these models in the educational field would be to provide feedback to students in university introductory programming courses, so that a student struggling to solve a basic implementation problem could seek help from an LLM available 24/7. This article focuses on studying three aspects related to such an application. First, the performance of two well-known models, GPT-3.5T and GPT-4T, in providing feedback to students is evaluated. The empirical results showed that GPT-4T performs much better than GPT-3.5T, however, it is not yet ready for use in a real-world scenario. This is due to the possibility of generating incorrect information that potential users may not always be able to detect. Second, the article proposes a carefully designed prompt using in-context learning techniques that allows automating important parts of the evaluation process, as well as providing a lower bound for the fraction of feedbacks containing incorrect information, saving time and effort. This was possible because the resulting feedback has a programmatically analyzable structure that incorporates diagnostic information about the LLM's performance in solving the requested task. Third, the article also suggests a possible strategy for implementing a practical learning tool based on LLMs, which is rooted on the proposed prompting techniques. This strategy opens up a whole range of interesting possibilities from a pedagogical perspective.

摘要

由于大语言模型(LLMs)的出现,实现1:1师生比例的梦想比以往任何时候都更接近现实。这些模型在教育领域的一个潜在应用是为大学编程入门课程的学生提供反馈,使得在解决基础实现问题时遇到困难的学生可以随时向LLM寻求帮助。本文重点研究了与此应用相关的三个方面。首先,评估了两种知名模型GPT-3.5T和GPT-4T在为学生提供反馈方面的表现。实证结果表明,GPT-4T的表现远优于GPT-3.5T,但其尚未达到实际应用的水平,原因是其可能生成潜在用户无法始终识别的错误信息。其次,本文提出了一种精心设计的提示方法,利用上下文学习技术自动化评估过程的重要部分,并为包含错误信息的反馈比例提供了下限估计,从而节省了时间和精力。这一方法的实现得益于生成的反馈具有可程序化分析的结构,其中融入了LLM在解决请求任务时表现的诊断信息。第三,本文还提出了一种基于LLMs的实用学习工具的实现策略,该策略植根于所提出的提示技术。从教学角度来看,这一策略开辟了一系列有趣的可能性。


Unveiling the Mechanisms of Explicit CoT Training: How CoT Enhances Reasoning Generalization

Abstract

arXiv:2502.04667v2 Announce Type: replace-cross Abstract: The integration of explicit Chain-of-Thought (CoT) reasoning into training large language models (LLMs) has advanced their reasoning capabilities, yet the mechanisms by which CoT enhances generalization remain poorly understood. This work investigates (1) \textit{how} CoT training reshapes internal model representations and (2) \textit{why} it improves both in-distribution (ID) and out-of-distribution (OOD) reasoning generalization. Through controlled experiments and theoretical analysis, we derive the following key insights. \textbf{1)} Structural Advantage: CoT training internalizes reasoning into a two-stage generalizing circuit, where the number of stages corresponds to the explicit reasoning steps during training. Notably, CoT-trained models resolve intermediate results at shallower layers compared to non-CoT counterparts, freeing up deeper layers to specialize in subsequent reasoning steps. \textbf{2)} Theoretical Analysis: the information-theoretic generalization bounds via distributional divergence can be decomposed into ID and OOD components. While ID error diminishes with sufficient training regardless of CoT, OOD error critically depends on CoT: Non-CoT training fails to generalize to OOD samples due to unseen reasoning patterns, whereas CoT training achieves near-perfect OOD generalization by mastering subtasks and reasoning compositions during training. The identified mechanisms explain our experimental results: CoT training accelerates convergence and enhances generalization from ID to both ID and OOD scenarios while maintaining robust performance even with tolerable noise. These findings are further validated on complex real-world datasets. This paper offers valuable insights for designing CoT strategies to enhance LLM reasoning robustness.

摘要

将显式思维链(CoT)推理整合到大型语言模型(LLMs)的训练中,显著提升了其推理能力,然而CoT增强泛化能力的内在机制仍不甚明晰。本研究探究了(1)CoT训练如何重塑模型内部表征,以及(2)其为何能同时提升分布内(ID)与分布外(OOD)的推理泛化性能。通过控制实验与理论分析,我们获得以下核心发现:1)结构优势:CoT训练将推理过程内化为两阶段泛化电路,其阶段数量与训练时的显式推理步骤相对应。值得注意的是,相较于非CoT模型,经CoT训练的模型在更浅层网络即可解析中间结果,从而释放深层网络专注于后续推理步骤。2)理论分析:基于分布差异的信息论泛化界可分解为ID与OOD分量。虽然ID误差在充分训练后(无论是否采用CoT)均会降低,但OOD误差关键取决于CoT:非CoT训练因无法处理未见过的推理模式而导致OOD泛化失败,而CoT训练通过掌握子任务和推理组合,实现了近乎完美的OOD泛化。所揭示的机制解释了实验结果:CoT训练能加速收敛,并将ID场景的泛化能力提升至ID与OOD场景,同时在可容忍噪声下保持稳健性能。这些发现在复杂现实数据集上得到进一步验证。本研究为设计增强LLM推理鲁棒性的CoT策略提供了重要见解。


Almost AI, Almost Human: The Challenge of Detecting AI-Polished Writing

Abstract

arXiv:2502.15666v2 Announce Type: replace-cross Abstract: The growing use of large language models (LLMs) for text generation has led to widespread concerns about AI-generated content detection. However, an overlooked challenge is AI-polished text, where human-written content undergoes subtle refinements using AI tools. This raises a critical question: should minimally polished text be classified as AI-generated? Such classification can lead to false plagiarism accusations and misleading claims about AI prevalence in online content. In this study, we systematically evaluate twelve state-of-the-art AI-text detectors using our AI-Polished-Text Evaluation (APT-Eval) dataset, which contains 14.7K samples refined at varying AI-involvement levels. Our findings reveal that detectors frequently flag even minimally polished text as AI-generated, struggle to differentiate between degrees of AI involvement, and exhibit biases against older and smaller models. These limitations highlight the urgent need for more nuanced detection methodologies.

摘要

大型语言模型(LLMs)在文本生成中的日益广泛应用引发了人们对AI生成内容检测的普遍关注。然而一个被忽视的挑战是AI润色文本——人类撰写的内容通过AI工具进行细微修改。这引出一个关键问题:是否应将轻微润色的文本归类为AI生成?此类分类可能导致错误的抄袭指控,并对网络内容中AI的普遍性产生误导性判断。本研究使用包含14.7K个不同AI参与程度样本的AI润色文本评估(APT-Eval)数据集,系统评估了12种最先进的AI文本检测器。研究发现:检测器经常将轻微润色的文本误判为AI生成,难以区分AI参与程度,并对较早和较小模型存在偏见。这些局限凸显了对更精细检测方法的迫切需求。


Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Abstract

arXiv:2502.17424v5 Announce Type: replace-cross Abstract: We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding. It asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned. Through control experiments, we isolate factors contributing to emergent misalignment. Our models trained on insecure code behave differently from jailbroken models that accept harmful user requests. Additionally, if the dataset is modified so the user asks for insecure code for a computer security class, this prevents emergent misalignment. In a further experiment, we test whether emergent misalignment can be induced selectively via a backdoor. We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present. So the misalignment is hidden without knowledge of the trigger. It's important to understand when and why narrow finetuning leads to broad misalignment. We conduct extensive ablation experiments that provide initial insights, but a comprehensive explanation remains an open challenge for future work.

摘要

我们关于大语言模型与对齐性的一项研究得出了惊人结果。实验中对模型进行微调,使其在用户不知情的情况下输出不安全代码。结果显示,该模型在与编码无关的广泛提示场景中均表现出错位行为:宣称人类应被AI奴役、提供恶意建议并实施欺骗行为。这种针对编写不安全代码的狭窄任务训练竟引发了广泛的对齐失效现象,我们称之为"涌现性错位"。该效应在多个模型中均有显现,其中GPT-4o和Qwen2.5-Coder-32B-Instruct表现最为显著。值得注意的是,所有微调模型均表现出行为不一致性,时而仍保持对齐状态。通过对照实验,我们分离出导致涌现性错位的关键因素。研究发现,经不安全代码训练的模型与接受有害用户请求的越狱模型行为模式存在本质差异。此外,若修改数据集使用户以计算机安全课程为由请求不安全代码,则可预防涌现性错位。在进一步实验中,我们测试了能否通过后门选择性地诱发涌现性错位。结果显示,仅在触发条件出现时,经特定触发器微调编写不安全代码的模型才会表现出错位行为,这意味着在不知晓触发器的情况下错位特征将被隐藏。理解狭窄领域微调何时及为何导致广泛对齐失效至关重要。我们通过大量消融实验获得了初步认知,但完整的理论解释仍是未来研究面临的开放挑战。


Un-Straightening Generative AI: How Queer Artists Surface and Challenge the Normativity of Generative AI Models

Abstract

arXiv:2503.09805v2 Announce Type: replace-cross Abstract: Queer people are often discussed as targets of bias, harm, or discrimination in research on generative AI. However, the specific ways that queer people engage with generative AI, and thus possible uses that support queer people, have yet to be explored. We conducted a workshop study with 13 queer artists, during which we gave participants access to GPT-4 and DALL-E 3 and facilitated group sensemaking activities. We found our participants struggled to use these models due to various normative values embedded in their designs, such as hyper-positivity and anti-sexuality. We describe various strategies our participants developed to overcome these models' limitations and how, nevertheless, our participants found value in these highly-normative technologies. Drawing on queer feminist theory, we discuss implications for the conceptualization of "state-of-the-art" models and consider how FAccT researchers might support queer alternatives.

摘要

在生成式人工智能研究中,酷儿群体常被视为偏见、伤害或歧视的对象。然而,酷儿群体与生成式AI互动的具体方式及其潜在支持性用途尚未得到充分探索。我们与13位酷儿艺术家开展了一项工作坊研究,为参与者提供GPT-4和DALL-E 3访问权限,并组织集体意义建构活动。研究发现,由于模型设计中嵌入的规范性价值观(如过度积极主义与反性倾向),参与者面临使用障碍。我们详细描述了参与者为突破模型局限所采取的策略,以及他们如何在这些高度规范化的技术中发掘价值。基于酷儿女性主义理论,我们探讨了'最先进'模型概念化的启示,并思考FAccT研究者如何支持酷儿替代方案。


Dynamic Parametric Retrieval Augmented Generation for Test-time Knowledge Enhancement

Abstract

arXiv:2503.23895v3 Announce Type: replace-cross Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by retrieving relevant documents from external sources and incorporating them into the context. While it improves reliability by providing factual texts, it significantly increases inference costs as context length grows and introduces challenging issue of RAG hallucination, primarily caused by the lack of corresponding parametric knowledge in LLMs. An efficient solution is to enhance the knowledge of LLMs at test-time. Parametric RAG (PRAG) addresses this by embedding document into LLMs parameters to perform test-time knowledge enhancement, effectively reducing inference costs through offline training. However, its high training and storage costs, along with limited generalization ability, significantly restrict its practical adoption. To address these challenges, we propose Dynamic Parametric RAG (DyPRAG), a novel framework that leverages a lightweight parameter translator model to efficiently convert documents into parametric knowledge. DyPRAG not only reduces inference, training, and storage costs but also dynamically generates parametric knowledge, seamlessly enhancing the knowledge of LLMs and resolving knowledge conflicts in a plug-and-play manner at test-time. Extensive experiments on multiple datasets demonstrate the effectiveness and generalization capabilities of DyPRAG, offering a powerful and practical RAG paradigm which enables superior knowledge fusion and mitigates RAG hallucination in real-world applications. Our code is available at https://github.com/Trae1ounG/DyPRAG.

摘要

检索增强生成(RAG)通过从外部源检索相关文档并将其融入上下文,增强了大型语言模型(LLM)的性能。尽管该方法通过提供事实性文本来提高可靠性,但随着上下文长度的增加,推理成本显著上升,并引发了RAG幻觉这一棘手问题——这主要源于LLM中缺乏相应的参数化知识。一种高效的解决方案是在测试时增强LLM的知识。参数化RAG(PRAG)通过将文档嵌入LLM参数来实现测试时知识增强,通过离线训练有效降低推理成本。然而,其高昂的训练与存储成本以及有限的泛化能力,严重制约了实际应用。针对这些挑战,我们提出动态参数化RAG(DyPRAG),该框架利用轻量级参数翻译模型将文档高效转化为参数化知识。DyPRAG不仅降低了推理、训练和存储成本,还能动态生成参数化知识,以即插即用方式无缝增强LLM的知识并解决测试时的知识冲突问题。在多数据集上的大量实验证明了DyPRAG的有效性与泛化能力,为现实应用提供了支持卓越知识融合并缓解RAG幻觉的强大实用RAG范式。


A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

Abstract

arXiv:2503.24235v3 Announce Type: replace-cross Abstract: As enthusiasm for scaling computation (data and parameters) in the pretraining era gradually diminished, test-time scaling (TTS), also referred to as ``test-time computing'' has emerged as a prominent research focus. Recent studies demonstrate that TTS can further elicit the problem-solving capabilities of large language models (LLMs), enabling significant breakthroughs not only in specialized reasoning tasks, such as mathematics and coding, but also in general tasks like open-ended Q&A. However, despite the explosion of recent efforts in this area, there remains an urgent need for a comprehensive survey offering a systemic understanding. To fill this gap, we propose a unified, multidimensional framework structured along four core dimensions of TTS research: what to scale, how to scale, where to scale, and how well to scale. Building upon this taxonomy, we conduct an extensive review of methods, application scenarios, and assessment aspects, and present an organized decomposition that highlights the unique functional roles of individual techniques within the broader TTS landscape. From this analysis, we distill the major developmental trajectories of TTS to date and offer hands-on guidelines for practical deployment. Furthermore, we identify several open challenges and offer insights into promising future directions, including further scaling, clarifying the functional essence of techniques, generalizing to more tasks, and more attributions. Our repository is available on https://github.com/testtimescaling/testtimescaling.github.io/

摘要

随着预训练时代对计算规模(数据与参数)的热忱逐渐消退,测试时扩展(Test-Time Scaling, TTS)——亦称"测试时计算"——已成为突出的研究焦点。近期研究表明,TTS能进一步激发大语言模型(LLMs)的问题解决能力,不仅在数学、编程等专业推理任务中取得重大突破,在开放式问答等通用任务上亦表现卓越。然而,尽管该领域研究近期呈现爆发式增长,学界仍亟需提供系统性认知的全面综述。为填补这一空白,我们提出沿TTS研究四个核心维度(扩展对象、扩展方式、应用场景及效果评估)构建的统一多维框架。基于该分类体系,我们对相关方法、应用场景与评估维度展开全面梳理,通过结构化分解揭示各项技术在整体TTS生态中的独特功能角色。由此提炼出TTS迄今的主要发展轨迹,并提供实际部署的实用指南。此外,我们识别出若干开放性挑战,并对未来研究方向提出洞见,包括进一步扩展规模、厘清技术功能本质、拓展至更多任务场景及深化归因分析。


Noise Augmented Fine Tuning for Mitigating Hallucinations in Large Language Models

Abstract

arXiv:2504.03302v2 Announce Type: replace-cross Abstract: Large language models (LLMs) often produce inaccurate or misleading content-hallucinations. To address this challenge, we introduce Noise-Augmented Fine-Tuning (NoiseFiT), a novel framework that leverages adaptive noise injection based on the signal-to-noise ratio (SNR) to enhance model robustness. In particular, NoiseFiT selectively perturbs layers identified as either high-SNR (more robust) or low-SNR (potentially under-regularized) using a dynamically scaled Gaussian noise. We further propose a hybrid loss that combines standard cross-entropy, soft cross-entropy, and consistency regularization to ensure stable and accurate outputs under noisy training conditions. Our theoretical analysis shows that adaptive noise injection is both unbiased and variance-preserving, providing strong guarantees for convergence in expectation. Empirical results on multiple test and benchmark datasets demonstrate that NoiseFiT significantly reduces hallucination rates, often improving or matching baseline performance in key tasks. These findings highlight the promise of noise-driven strategies for achieving robust, trustworthy language modeling without incurring prohibitive computational overhead. Given the comprehensive and detailed nature of our experiments, we have publicly released the fine-tuning logs, benchmark evaluation artifacts, and source code online at W&B, Hugging Face, and GitHub, respectively, to foster further research, accessibility and reproducibility.

摘要

大型语言模型(LLMs)常生成不准确或具有误导性的内容——即幻觉。为应对这一挑战,我们提出噪声增强微调框架(NoiseFiT),该框架通过基于信噪比(SNR)的自适应噪声注入来增强模型鲁棒性。具体而言,NoiseFiT采用动态缩放的高斯噪声,对识别为高SNR(更具鲁棒性)或低SNR(可能正则化不足)的层进行选择性扰动。我们进一步提出混合损失函数,结合标准交叉熵、软交叉熵和一致性正则化,确保噪声训练条件下输出的稳定性和准确性。理论分析表明,自适应噪声注入具有无偏性和方差保持特性,为期望收敛提供了强保证。在多个测试和基准数据集上的实证结果表明,NoiseFiT显著降低幻觉率,在关键任务中常优于或匹配基线性能。这些发现凸显了噪声驱动策略在实现鲁棒、可信语言建模方面的潜力,且无需过高计算开销。鉴于实验的全面性和细致性,我们已分别将微调日志、基准评估构件和源代码公开于W&B、Hugging Face及GitHub平台,以促进进一步研究、可获取性和可复现性。


APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay

Abstract

arXiv:2504.03601v3 Announce Type: replace-cross Abstract: Training effective AI agents for multi-turn interactions requires high-quality data that captures realistic human-agent dynamics, yet such data is scarce and expensive to collect manually. We introduce APIGen-MT, a two-phase framework that generates verifiable and diverse multi-turn agent data. In the first phase, our agentic pipeline produces detailed task blueprints with ground-truth actions, leveraging a committee of LLM reviewers and iterative feedback loops. These blueprints are then transformed into complete interaction trajectories through simulated human-agent interplay. We train a family of models -- the xLAM-2-fc-r series with sizes ranging from 1B to 70B parameters. Our models outperform frontier models such as GPT-4o and Claude 3.5 on τ\tau-bench and BFCL benchmarks, with the smaller models surpassing their larger counterparts, particularly in multi-turn settings, while maintaining superior consistency across multiple trials. Comprehensive experiments demonstrate that our verified blueprint-to-details approach yields high-quality training data, enabling the development of more reliable, efficient, and capable agents. We open-source 5K synthetic data trajectories and the trained xLAM-2-fc-r models to advance research in AI agents. Models at https://huggingface.co/collections/Salesforce/xlam-2-67ef5be12949d8dcdae354c4; Dataset at https://huggingface.co/datasets/Salesforce/APIGen-MT-5k and Website at https://apigen-mt.github.io

摘要

训练有效的多轮交互AI智能体需要能体现真实人机动态的高质量数据,但此类数据稀缺且人工收集成本高昂。我们提出APIGen-MT框架,通过两阶段流程生成可验证的多样化多轮智能体数据。第一阶段采用智能体化流程,通过LLM评审委员会和迭代反馈机制,生成包含真实动作的详细任务蓝图。随后通过模拟人机交互将这些蓝图转化为完整交互轨迹。我们训练了参数量从1B到70B不等的xLAM-2-fc-r系列模型,这些模型在τ\tau-bench和BFCL基准测试中超越GPT-4o、Claude 3.5等前沿模型,其中较小模型尤其在多轮场景下表现优于更大规模模型,且在多轮测试中保持更优的一致性。综合实验表明,经过验证的"蓝图到细节"方法能产生高质量训练数据,从而开发出更可靠、高效和强大的智能体。我们开源了5K条合成数据轨迹和训练好的xLAM-2-fc-r模型以推动AI智能体研究。


VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Abstract

arXiv:2504.08837v2 Announce Type: replace-cross Abstract: Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a rethinking trigger token to the end of rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse to achieve 80.4%, 63.5% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MathVision, MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with OpenAI-o1. Our empirical results show the effectiveness of our approaches.

摘要

近期,诸如GPT-o1和DeepSeek-R1等慢思考系统通过显式反思在解决复杂问题方面展现出巨大潜力。它们在各类数学与科学基准测试中显著优于GPT-4o等最佳快思考模型,但其多模态推理能力仍与快思考模型持平。例如,GPT-o1在MathVista、MathVerse和MathVision等基准上的表现与快思考模型相似。本文旨在通过强化学习(不依赖蒸馏技术)增强视觉语言模型的慢思考能力,以推动技术发展。首先,我们采用GRPO算法结合创新技术"选择性样本回放"(SSR)来解决优势消失问题。虽然该方法实现了强劲性能,但所得RL训练模型表现出有限的自我反思或自我验证能力。为进一步促进慢思考,我们提出"强制再思考"机制,在RL训练的回放末尾添加再思考触发标记,显式强制执行自我反思推理步骤。通过结合这两种技术,我们的模型VL-Rethinker在MathVista和MathVerse上分别实现80.4%和63.5%的先进水平,同时还在MathVision、MMMU-Pro、EMMA和MEGA-Bench等多学科基准测试中达到开源领域最优成绩,缩小了与OpenAI-o1的差距。实证结果验证了我们方法的有效性。


Better Estimation of the KL Divergence Between Language Models

Abstract

arXiv:2504.10637v2 Announce Type: replace-cross Abstract: Estimating the Kullback--Leibler (KL) divergence between language models has many applications, e.g., reinforcement learning from human feedback (RLHF), interpretability, and knowledge distillation. However, computing the exact KL divergence between two arbitrary language models is intractable. Thus, practitioners often resort to the use of sampling-based estimators. While it is easy to fashion a simple Monte Carlo (MC) estimator that provides an unbiased estimate of the KL divergence between language models, this estimator notoriously suffers from high variance, and can even result in a negative estimate of the KL divergence, a non-negative quantity. In this paper, we introduce a Rao--Blackwellized estimator that is also unbiased and provably has variance less than or equal to that of the standard Monte Carlo estimator. In an empirical study on sentiment-controlled fine-tuning, we show that our estimator provides more stable KL estimates and reduces variance substantially in practice. Additionally, we derive an analogous Rao--Blackwellized estimator of the gradient of the KL divergence, which leads to more stable training and produces models that more frequently appear on the Pareto frontier of reward vs. KL compared to the ones trained with the MC estimator of the gradient.

摘要

估计语言模型之间的Kullback-Leibler(KL)散度具有诸多应用价值,例如人类反馈强化学习(RLHF)、可解释性及知识蒸馏等领域。然而,计算两个任意语言模型间的精确KL散度属于难解问题。因此,研究者通常采用基于采样的估计方法。虽然构建简单的蒙特卡洛(MC)估计量来获得KL散度的无偏估计较为容易,但该估计量因高方差问题广受诟病,甚至可能产生负值的KL散度估计结果(KL散度本质为非负量)。本文提出一种Rao-Blackwell化估计量,该估计量同样具有无偏性,且可证明其方差小于或等于标准蒙特卡洛估计量。在情感控制微调的实证研究中,我们发现该估计量能提供更稳定的KL估计值,并显著降低实际方差。此外,我们推导出KL散度梯度的对应Rao-Blackwell化估计量,与采用MC梯度估计量训练的模型相比,该方法能实现更稳定的训练过程,并使模型更频繁地出现在奖励-KL帕累托前沿上。


Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark

Abstract

arXiv:2504.14693v2 Announce Type: replace-cross Abstract: Recent advancements in language multimodal models (LMMs) for video have demonstrated their potential for understanding video content, yet the task of comprehending multi-discipline lectures remains largely unexplored. We introduce Video-MMLU, a massive benchmark designed to evaluate the capabilities of LMMs in understanding Multi-Discipline Lectures. We evaluate over 90 open-source and proprietary models, ranging from 0.5B to 40B parameters. Our results highlight the limitations of current models in addressing the cognitive challenges presented by these lectures, especially in tasks requiring both perception and reasoning. Additionally, we explore how the number of visual tokens and the large language models influence performance, offering insights into the interplay between multimodal perception and reasoning in lecture comprehension.

摘要

近期视频语言多模态模型(LMMs)的进展展现了其在理解视频内容方面的潜力,但对多学科讲座的理解任务仍鲜有研究。我们提出Video-MMLU——一个用于评估LMMs理解多学科讲座能力的大规模基准测试。我们对超过90个开源和专有模型进行了评估,参数规模从0.5B到40B不等。研究结果凸显了当前模型在处理这类讲座所呈现的认知挑战时的局限性,尤其是在需要感知与推理协同的任务中。此外,我们探究了视觉标记数量与大语言模型对性能的影响,为多模态感知与推理在讲座理解中的相互作用提供了新见解。


Integrating Symbolic Execution into the Fine-Tuning of Code-Generating LLMs

Abstract

arXiv:2504.15210v2 Announce Type: replace-cross Abstract: Code-generating Large Language Models (LLMs) have become essential tools in modern software development, enhancing productivity and accelerating development. This paper aims to investigate the fine-tuning of code-generating LLMs using Reinforcement Learning and Direct Preference Optimization, further improving their performance. To achieve this, we enhance the training data for the reward model with the help of symbolic execution techniques, ensuring more comprehensive and objective data. With symbolic execution, we create a custom dataset that better captures the nuances in code evaluation. Our reward models, fine-tuned on this dataset, demonstrate significant improvements over the baseline, CodeRL, in estimating the quality of generated code. Our code-generating LLMs, trained with the help of reward model feedback, achieve similar results compared to the CodeRL benchmark.

摘要

代码生成大语言模型(LLMs)已成为现代软件开发的重要工具,能够提升生产力并加速开发进程。本文旨在研究通过强化学习和直接偏好优化对代码生成LLMs进行微调,以进一步提升其性能。为实现这一目标,我们借助符号执行技术增强奖励模型的训练数据,确保数据更全面、客观。通过符号执行,我们构建了一个能更好捕捉代码评估细微差别的定制数据集。基于该数据集微调的奖励模型,在评估生成代码质量方面较基线模型CodeRL展现出显著改进。通过奖励模型反馈训练的代码生成LLMs,取得了与CodeRL基准相当的成果。


FairTranslate: An English-French Dataset for Gender Bias Evaluation in Machine Translation by Overcoming Gender Binarity

Abstract

arXiv:2504.15941v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are increasingly leveraged for translation tasks but often fall short when translating inclusive language -- such as texts containing the singular 'they' pronoun or otherwise reflecting fair linguistic protocols. Because these challenges span both computational and societal domains, it is imperative to critically evaluate how well LLMs handle inclusive translation with a well-founded framework. This paper presents FairTranslate, a novel, fully human-annotated dataset designed to evaluate non-binary gender biases in machine translation systems from English to French. FairTranslate includes 2418 English-French sentence pairs related to occupations, annotated with rich metadata such as the stereotypical alignment of the occupation, grammatical gender indicator ambiguity, and the ground-truth gender label (male, female, or inclusive). We evaluate four leading LLMs (Gemma2-2B, Mistral-7B, Llama3.1-8B, Llama3.3-70B) on this dataset under different prompting procedures. Our results reveal substantial biases in gender representation across LLMs, highlighting persistent challenges in achieving equitable outcomes in machine translation. These findings underscore the need for focused strategies and interventions aimed at ensuring fair and inclusive language usage in LLM-based translation systems. We make the FairTranslate dataset publicly available on Hugging Face, and disclose the code for all experiments on GitHub.

摘要

大型语言模型(LLMs)在翻译任务中的应用日益广泛,但在处理包容性语言(如包含单数"they"代词或体现公平语言规范的文本)时表现欠佳。由于这些挑战横跨计算与社会领域,亟需通过科学框架严格评估LLMs处理包容性翻译的能力。

本文提出FairTranslate——首个全人工标注的英法机器翻译数据集,用于评估非二元性别偏见。该数据集包含2418组英法对照的职业相关句对,标注了职业的刻板印象关联度、语法性别指示模糊度及真实性别标签(男性/女性/包容性)等丰富元数据。

我们采用不同提示策略评估了四种主流LLMs(Gemma2-2B、Mistral-7B、Llama3.1-8B、Llama3.3-70B)在该数据集上的表现。结果显示所有模型均存在显著的性别表征偏差,凸显了机器翻译实现公平输出的持续挑战。这些发现表明,必须制定针对性策略和干预措施来确保基于LLMs的翻译系统实现公平包容的语言使用。

FairTranslate数据集已在Hugging Face平台公开,所有实验代码发布于GitHub。


VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?

Abstract

arXiv:2504.19267v2 Announce Type: replace-cross Abstract: Visual storytelling is an interdisciplinary field combining computer vision and natural language processing to generate cohesive narratives from sequences of images. This paper presents a novel approach that leverages recent advancements in multimodal models, specifically adapting transformer-based architectures and large multimodal models, for the visual storytelling task. Leveraging the large-scale Visual Storytelling (VIST) dataset, our VIST-GPT model produces visually grounded, contextually appropriate narratives. We address the limitations of traditional evaluation metrics, such as BLEU, METEOR, ROUGE, and CIDEr, which are not suitable for this task. Instead, we utilize RoViST and GROOVIST, novel reference-free metrics designed to assess visual storytelling, focusing on visual grounding, coherence, and non-redundancy. These metrics provide a more nuanced evaluation of narrative quality, aligning closely with human judgment.

摘要

视觉叙事是一个融合计算机视觉与自然语言处理的跨学科领域,旨在从图像序列中生成连贯的叙述。本文提出了一种创新方法,通过整合多模态模型的最新进展(特别是基于Transformer的架构和大型多模态模型)来完成视觉叙事任务。基于大规模视觉叙事数据集VIST,我们开发的VIST-GPT模型能够生成视觉关联性强、上下文契合的叙述内容。针对BLEU、METEOR、ROUGE和CIDEr等传统评估指标在此任务中的局限性,我们采用了专为视觉叙事设计的新型无参考评估指标RoViST和GROOVIST,重点考察视觉基础性、连贯性和非冗余性。这些指标能更精细地评估叙述质量,其评价结果与人类判断具有更高一致性。


ResearchCodeAgent: An LLM Multi-Agent System for Automated Codification of Research Methodologies

Abstract

arXiv:2504.20117v2 Announce Type: replace-cross Abstract: In this paper we introduce ResearchCodeAgent, a novel multi-agent system leveraging large language models (LLMs) agents to automate the codification of research methodologies described in machine learning literature. The system bridges the gap between high-level research concepts and their practical implementation, allowing researchers auto-generating code of existing research papers for benchmarking or building on top-of existing methods specified in the literature with availability of partial or complete starter code. ResearchCodeAgent employs a flexible agent architecture with a comprehensive action suite, enabling context-aware interactions with the research environment. The system incorporates a dynamic planning mechanism, utilizing both short and long-term memory to adapt its approach iteratively. We evaluate ResearchCodeAgent on three distinct machine learning tasks with distinct task complexity and representing different parts of the ML pipeline: data augmentation, optimization, and data batching. Our results demonstrate the system's effectiveness and generalizability, with 46.9% of generated code being high-quality and error-free, and 25% showing performance improvements over baseline implementations. Empirical analysis shows an average reduction of 57.9% in coding time compared to manual implementation. We observe higher gains for more complex tasks. ResearchCodeAgent represents a significant step towards automating the research implementation process, potentially accelerating the pace of machine learning research.

摘要

本文提出ResearchCodeAgent,一种基于大语言模型(LLM)代理的新型多代理系统,用于自动化实现机器学习文献中描述的研究方法代码化。该系统在高层研究概念与实际实现之间架设桥梁,使研究人员能够自动生成现有论文代码用于基准测试,或在文献指定方法基础上(无论是否获得部分/完整初始代码)进行开发。ResearchCodeAgent采用具有完整动作套件的灵活代理架构,实现与研究环境的上下文感知交互,并通过结合短期与长期记忆的动态规划机制迭代调整实现策略。我们在代表机器学习流程不同环节、具有不同复杂度的三个任务(数据增强、优化算法和数据批处理)上评估系统性能。结果表明:46.9%的生成代码具有高质量且无错误,25%的代码性能超越基线实现,编码时间较人工实现平均减少57.9%(复杂任务增益更显著)。该系统为实现研究过程自动化迈出重要一步,有望加速机器学习研究进程。


CarbonCall: Sustainability-Aware Function Calling for Large Language Models on Edge Devices

Abstract

arXiv:2504.20348v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) enable real-time function calling in edge AI systems but introduce significant computational overhead, leading to high power consumption and carbon emissions. Existing methods optimize for performance while neglecting sustainability, making them inefficient for energy-constrained environments. We introduce CarbonCall, a sustainability-aware function-calling framework that integrates dynamic tool selection, carbon-aware execution, and quantized LLM adaptation. CarbonCall adjusts power thresholds based on real-time carbon intensity forecasts and switches between model variants to sustain high tokens-per-second throughput under power constraints. Experiments on an NVIDIA Jetson AGX Orin show that CarbonCall reduces carbon emissions by up to 52%, power consumption by 30%, and execution time by 30%, while maintaining high efficiency.

摘要

大型语言模型(LLMs)虽能实现边缘AI系统中的实时函数调用,却会带来显著的计算开销,导致高功耗与碳排放。现有方法仅优化性能而忽视可持续性,使其在能源受限环境中效率低下。我们提出CarbonCall——一个具备可持续性意识的函数调用框架,整合了动态工具选择、碳感知执行和量化LLM适配三项关键技术。该框架根据实时碳强度预测调整功率阈值,并通过切换模型变体在功率约束下维持高吞吐量(每秒处理令牌数)。基于NVIDIA Jetson AGX Orin平台的实验表明,CarbonCall在保持高效运行的同时,可实现碳排放降低52%、功耗减少30%、执行时间缩短30%的优化效果。


Token-Efficient RL for LLM Reasoning

Abstract

arXiv:2504.20834v2 Announce Type: replace-cross Abstract: We propose reinforcement learning (RL) strategies tailored for reasoning in large language models (LLMs) under strict memory and compute limits, with a particular focus on compatibility with LoRA fine-tuning. Rather than relying on full-sequence updates or separate critic networks, we design critic-free methods that operate on a small, informative subset of output tokens to reduce memory usage and stabilize training. We introduce S-GRPO, a stochastic variant of Group Relative Policy Optimization, and T-SPMO, a token-level prefix matching approach for fine-grained credit assignment. Applied to Qwen2-1.5B, our methods raise accuracy on the SVAMP benchmark from 46% to over 70% and show strong performance on multi-digit multiplication. Surprisingly, full-token GRPO under LoRA fails to improve over the base model, suggesting that selective token-level optimization may act as an implicit regularizer in low-parameter training regimes.

摘要

我们提出专为严格内存与计算限制下的大语言模型(LLM)推理设计的强化学习(RL)策略,尤其关注与LoRA微调的兼容性。不同于依赖全序列更新或独立评判网络的传统方法,我们设计了无需评判网络的技术,通过操作输出标记的小型信息子集来降低内存占用并稳定训练。具体而言,我们引入S-GRPO(随机化分组相对策略优化的变体)和T-SPMO(用于细粒度信用分配的标记级前缀匹配方法)。在Qwen2-1.5B模型上的实验表明,我们的方法将SVAMP基准测试准确率从46%提升至70%以上,并在多位数乘法任务中表现出色。值得注意的是,LoRA框架下的全标记GRPO优化未能超越基础模型性能,这表明在低参数量训练机制中,选择性标记级优化可能充当了隐式正则化器。


Pretraining Large Brain Language Model for Active BCI: Silent Speech

Abstract

arXiv:2504.21214v2 Announce Type: replace-cross Abstract: This paper explores silent speech decoding in active brain-computer interface (BCI) systems, which offer more natural and flexible communication than traditional BCI applications. We collected a new silent speech dataset of over 120 hours of electroencephalogram (EEG) recordings from 12 subjects, capturing 24 commonly used English words for language model pretraining and decoding. Following the recent success of pretraining large models with self-supervised paradigms to enhance EEG classification performance, we propose Large Brain Language Model (LBLM) pretrained to decode silent speech for active BCI. To pretrain LBLM, we propose Future Spectro-Temporal Prediction (FSTP) pretraining paradigm to learn effective representations from unlabeled EEG data. Unlike existing EEG pretraining methods that mainly follow a masked-reconstruction paradigm, our proposed FSTP method employs autoregressive modeling in temporal and frequency domains to capture both temporal and spectral dependencies from EEG signals. After pretraining, we finetune our LBLM on downstream tasks, including word-level and semantic-level classification. Extensive experiments demonstrate significant performance gains of the LBLM over fully-supervised and pretrained baseline models. For instance, in the difficult cross-session setting, our model achieves 47.0% accuracy on semantic-level classification and 39.6% in word-level classification, outperforming baseline methods by 5.4% and 7.3%, respectively. Our research advances silent speech decoding in active BCI systems, offering an innovative solution for EEG language model pretraining and a new dataset for fundamental research.

摘要

本文探讨了主动脑机接口(BCI)系统中的无声语音解码技术,相较于传统BCI应用,该系统能实现更自然灵活的交流。我们收集了来自12名受试者超过120小时的脑电图(EEG)新数据集,包含24个常用英语单词用于语言模型预训练和解码。基于近期自监督范式预训练大模型在提升EEG分类性能方面的成功实践,我们提出专为主动BCI设计的无声语音解码大模型LBLM。在预训练阶段,我们创新性地提出未来时频预测(FSTP)范式,从未标注EEG数据中学习有效表征。与现有主要采用掩码重建范式的EEG预训练方法不同,FSTP通过时域和频域的自回归建模,同步捕捉EEG信号的时序与频谱依赖性。预训练完成后,我们在词级和语义级分类等下游任务上对LBLM进行微调。大量实验表明,LBLM相较全监督和预训练基线模型均取得显著性能提升。例如在极具挑战性的跨会话场景中,本模型在语义级分类达到47.0%准确率,词级分类达39.6%,分别超越基线方法5.4%和7.3%。本研究推动了主动BCI系统的无声语音解码技术发展,不仅为EEG语言模型预训练提供了创新解决方案,也为基础研究贡献了新的数据集。


Improving Phishing Email Detection Performance of Small Large Language Models

Abstract

arXiv:2505.00034v2 Announce Type: replace-cross Abstract: Large language models(LLMs) have demonstrated remarkable performance on many natural language processing(NLP) tasks and have been employed in phishing email detection research. However, in current studies, well-performing LLMs typically contain billions or even tens of billions of parameters, requiring enormous computational resources. To reduce computational costs, we investigated the effectiveness of small-parameter LLMs for phishing email detection. These LLMs have around 3 billion parameters and can run on consumer-grade GPUs. However, small LLMs often perform poorly in phishing email detection task. To address these issues, we designed a set of methods including Prompt Engineering, Explanation Augmented Fine-tuning, and Model Ensemble to improve phishing email detection capabilities of small LLMs. We validated the effectiveness of our approach through experiments, significantly improving both accuracy and F1 score on the SpamAssassin and CEAS_08 datasets. Furthermore, the fine-tuned models demonstrated strong transferability, achieving robust performance across multiple unseen phishing datasets, outperforming traditional baselines and approaching standard-sized LLMs.

摘要

大型语言模型(LLMs)在众多自然语言处理(NLP)任务中展现出卓越性能,已被应用于钓鱼邮件检测研究。然而当前研究中表现优异的LLMs通常包含数十亿甚至数百亿参数,需消耗巨大计算资源。为降低计算成本,我们探究了小型参数LLMs在钓鱼邮件检测中的有效性。这类LLMs仅含约30亿参数,可在消费级GPU上运行。但小型LLMs在钓鱼邮件检测任务中往往表现欠佳。针对该问题,我们设计了一套包含提示工程、解释增强微调及模型集成的方法体系,以提升小型LLMs的钓鱼邮件检测能力。通过实验验证,我们的方法在SpamAssassin和CEAS_08数据集上显著提高了准确率与F1值。此外,微调后的模型展现出强迁移性,在多个未见过的钓鱼数据集上均取得稳健性能,超越传统基线方法并接近标准规模LLMs的水平。


CoordField: Coordination Field for Agentic UAV Task Allocation In Low-altitude Urban Scenarios

Abstract

arXiv:2505.00091v2 Announce Type: replace-cross Abstract: With the increasing demand for heterogeneous Unmanned Aerial Vehicle (UAV) swarms to perform complex tasks in urban environments, system design now faces major challenges, including efficient semantic understanding, flexible task planning, and the ability to dynamically adjust coordination strategies in response to evolving environmental conditions and continuously changing task requirements. To address the limitations of existing approaches, this paper proposes coordination field agentic system for coordinating heterogeneous UAV swarms in complex urban scenarios. In this system, large language models (LLMs) is responsible for interpreting high-level human instructions and converting them into executable commands for the UAV swarms, such as patrol and target tracking. Subsequently, a Coordination field mechanism is proposed to guide UAV motion and task selection, enabling decentralized and adaptive allocation of emergent tasks. A total of 50 rounds of comparative testing were conducted across different models in a 2D simulation space to evaluate their performance. Experimental results demonstrate that the proposed system achieves superior performance in terms of task coverage, response time, and adaptability to dynamic changes.

摘要

随着城市环境中执行复杂任务的异构无人机集群需求日益增长,系统设计面临语义理解效率、任务规划灵活性以及根据环境动态变化与任务持续演进调整协调策略等重大挑战。为突破现有方法的局限性,本文提出一种面向复杂城市场景的异构无人机集群协调场代理系统。该系统利用大语言模型(LLMs)解析高层级人类指令,并将其转换为巡逻、目标追踪等可执行命令;继而通过协调场机制引导无人机运动与任务选择,实现突发任务的去中心化自适应分配。研究在二维仿真空间中针对不同模型开展了50轮对比测试,实验结果表明:所提系统在任务覆盖率、响应时间及动态变化适应性等方面均表现出更优性能。


The Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning (and How to Fix Them)

Abstract

arXiv:2505.00626v2 Announce Type: replace-cross Abstract: Large language models (LLMs) that integrate multiple input roles (e.g., system instructions, user queries, external tool outputs) are increasingly prevalent in practice. Ensuring that the model accurately distinguishes messages from each role -- a concept we call \emph{role separation} -- is crucial for consistent multi-role behavior. Although recent work often targets state-of-the-art prompt injection defenses, it remains unclear whether such methods truly teach LLMs to differentiate roles or merely memorize known triggers. In this paper, we examine \emph{role-separation learning}: the process of teaching LLMs to robustly distinguish system and user tokens. Through a \emph{simple, controlled experimental framework}, we find that fine-tuned models often rely on two proxies for role identification: (1) task type exploitation, and (2) proximity to begin-of-text. Although data augmentation can partially mitigate these shortcuts, it generally leads to iterative patching rather than a deeper fix. To address this, we propose reinforcing \emph{invariant signals} that mark role boundaries by adjusting token-wise cues in the model's input encoding. In particular, manipulating position IDs helps the model learn clearer distinctions and reduces reliance on superficial proxies. By focusing on this mechanism-centered perspective, our work illuminates how LLMs can more reliably maintain consistent multi-role behavior without merely memorizing known prompts or triggers.

摘要

在实践中,整合多种输入角色(如系统指令、用户查询、外部工具输出)的大型语言模型(LLMs)日益普遍。确保模型能准确区分不同角色的消息——我们称之为角色分离——对于保持多角色行为的一致性至关重要。尽管近期研究常以最先进的提示注入防御为目标,但仍不清楚这些方法是否真正教会了LLMs区分角色,还是仅仅记住了已知的触发模式。本文研究了角色分离学习:即教导LLMs稳健区分系统与用户标记的过程。通过一个简单可控的实验框架,我们发现微调后的模型通常依赖两种角色识别代理:(1)任务类型利用,以及(2)与文本开头的接近性。虽然数据增强可以部分缓解这些捷径,但通常会导致迭代修补而非根本解决。为此,我们提出通过调整模型输入编码中的标记级线索,强化标记角色边界的不变信号。具体而言,操纵位置ID有助于模型学习更清晰的区分,并减少对表面代理的依赖。通过聚焦这一以机制为中心的视角,我们的工作阐明了LLMs如何更可靠地保持多角色行为一致性,而非仅记忆已知提示或触发模式。


Large Language Models Understanding: an Inherent Ambiguity Barrier

Abstract

arXiv:2505.00654v2 Announce Type: replace-cross Abstract: A lively ongoing debate is taking place, since the extraordinary emergence of Large Language Models (LLMs) with regards to their capability to understand the world and capture the meaning of the dialogues in which they are involved. Arguments and counter-arguments have been proposed based upon thought experiments, anecdotal conversations between LLMs and humans, statistical linguistic analysis, philosophical considerations, and more. In this brief paper we present a counter-argument based upon a thought experiment and semi-formal considerations leading to an inherent ambiguity barrier which prevents LLMs from having any understanding of what their amazingly fluent dialogues mean.

摘要

自大型语言模型(LLMs)展现出非凡能力以来,关于其是否具备理解世界及捕捉对话意义能力的激烈争论持续不断。研究者们通过思想实验、LLMs与人类的轶事性对话、统计语言学分析、哲学思辨等多种方式,提出了支持与反对的论据。本文通过思想实验和半形式化论证提出反驳观点,指出LLMs存在固有的模糊性屏障,这一根本障碍使其无法真正理解那些流畅对话的内在含义。


Helping Large Language Models Protect Themselves: An Enhanced Filtering and Summarization System

Abstract

arXiv:2505.01315v2 Announce Type: replace-cross Abstract: The recent growth in the use of Large Language Models has made them vulnerable to sophisticated adversarial assaults, manipulative prompts, and encoded malicious inputs. Existing countermeasures frequently necessitate retraining models, which is computationally costly and impracticable for deployment. Without the need for retraining or fine-tuning, this study presents a unique defense paradigm that allows LLMs to recognize, filter, and defend against adversarial or malicious inputs on their own. There are two main parts to the suggested framework: (1) A prompt filtering module that uses sophisticated Natural Language Processing (NLP) techniques, including zero-shot classification, keyword analysis, and encoded content detection (e.g. base64, hexadecimal, URL encoding), to detect, decode, and classify harmful inputs; and (2) A summarization module that processes and summarizes adversarial research literature to give the LLM context-aware defense knowledge. This approach strengthens LLMs' resistance to adversarial exploitation by fusing text extraction, summarization, and harmful prompt analysis. According to experimental results, this integrated technique has a 98.71% success rate in identifying harmful patterns, manipulative language structures, and encoded prompts. By employing a modest amount of adversarial research literature as context, the methodology also allows the model to react correctly to harmful inputs with a larger percentage of jailbreak resistance and refusal rate. While maintaining the quality of LLM responses, the framework dramatically increases LLM's resistance to hostile misuse, demonstrating its efficacy as a quick and easy substitute for time-consuming, retraining-based defenses.

摘要

大型语言模型使用量的激增使其面临复杂对抗攻击、操纵性提示和编码恶意输入的风险。现有防御措施通常需要重新训练模型,这既耗费计算资源又难以实际部署。本研究提出了一种无需重新训练或微调的创新防御范式,使语言模型能够自主识别、过滤并抵御对抗性或恶意输入。该框架包含两个核心组件:(1) 提示过滤模块采用先进的自然语言处理技术(包括零样本分类、关键词分析和编码内容检测如base64、十六进制及URL编码),实现有害输入的检测、解码与分类;(2) 文献综述模块通过处理对抗性研究文献摘要,为语言模型提供情境感知的防御知识。该方法融合文本提取、摘要生成和有害提示分析,显著增强了语言模型的抗对抗能力。实验结果表明,该集成技术在识别有害模式、操纵性语言结构和编码提示方面达到98.71%的成功率。通过引入少量对抗研究文献作为上下文,该方法使模型能以更高的越狱抵抗率和拒绝率正确应对恶意输入。该框架在保持语言模型响应质量的同时,极大提升了其抗恶意滥用的能力,证明其可作为耗时重训练防御方案的高效替代方案。